Skip to content

Commit 17a8578

Browse files
authored
Merge pull request #37 from Caltech-IPAC/raen/patch/neowise-add-yr11-and-addendum
IRSA-6436 Add NEOWISE year11 and addendum
2 parents f486eac + 6fb48fc commit 17a8578

File tree

2 files changed

+27
-16
lines changed

2 files changed

+27
-16
lines changed

tutorials/parquet-catalog-demos/neowise-source-table-lightcurves.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@ kernelspec:
1111
name: python3
1212
---
1313

14+
An executed version of this notebook can be seen on
15+
[IRSA's website](https://irsa.ipac.caltech.edu/docs/notebooks/neowise-source-table-lightcurves.html).
16+
17+
+++
18+
1419
# Make Light Curves from NEOWISE Single-exposure Source Table
1520

1621
+++
@@ -31,7 +36,7 @@ Learning Goals:
3136
This notebook loads light curves from the
3237
[NEOWISE](https://irsa.ipac.caltech.edu/Missions/wise.html) Single-exposure Source Table
3338
for a sample of about 2000 cataclysmic variables from [Downes et al. (2001)](https://doi.org/10.1086/320802).
34-
The NEOWISE Single-exposure Source Table is a very large catalog -- 10 years and 40 terabytes in total
39+
The NEOWISE Single-exposure Source Table is a very large catalog -- 11 years and 42 terabytes in total
3540
with 145 columns and 200 billion rows.
3641
When searching this catalog, it is important to consider the requirements of your use case and
3742
the format of this dataset.
@@ -57,7 +62,7 @@ The specific strategy we employ is:
5762

5863
The efficiency of this method will increase with the number of rows needed from each partition.
5964
For example, a cone search radius of 1 arcsec will require about 10 CPUs, 65G RAM, and
60-
50 minutes to load the data from all 10 NEOWISE years.
65+
50 minutes to load the data from all 11 NEOWISE years.
6166
Increasing the radius to 10 arcsec will return about 2.5x more rows using roughly the same resources.
6267
Increasing the target sample size can result in similar efficiency gains.
6368
To try out this notebook with fewer resources, use a subset of NEOWISE years.
@@ -104,7 +109,8 @@ Real use cases are likely to require all ten years but it can be helpful to star
104109
fewer while exploring to make things run faster.
105110

106111
```{code-cell} ipython3
107-
YEARS = list(range(1, 11)) # all years => about 11 CPU, 65G RAM, and 50 minutes runtime
112+
# all years => about 11 CPU, 65G RAM, and 50 minutes runtime
113+
YEARS = [f"year{yr}" for yr in range(1, 12)] + ["addendum"]
108114
109115
# To try out a smaller version of the notebook,
110116
# uncomment the next line and choose your own subset of years.
@@ -136,7 +142,7 @@ We'll load it as a pyarrow dataset.
136142
bucket = "nasa-irsa-wise"
137143
base_prefix = "wise/neowiser/catalogs/p1bs_psd/healpix_k5"
138144
metadata_path = (
139-
lambda yr: f"{bucket}/{base_prefix}/year{yr}/neowiser-healpix_k5-year{yr}.parquet/_metadata"
145+
lambda yr: f"{bucket}/{base_prefix}/{yr}/neowiser-healpix_k5-{yr}.parquet/_metadata"
140146
)
141147
fs = pyarrow.fs.S3FileSystem(region="us-west-2", anonymous=True)
142148
@@ -461,6 +467,6 @@ This has to do with differences in what does / does not get copied into the chil
461467

462468
**Author:** Troy Raen (IRSA Developer) and the IPAC Science Platform team
463469

464-
**Updated:** 2024-08-08
470+
**Updated:** 2025-03-07
465471

466472
**Contact:** [the IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or reporting problems.

tutorials/parquet-catalog-demos/neowise-source-table-strategies.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,19 @@ kernelspec:
1111
name: python3
1212
---
1313

14+
An executed version of this notebook can be seen on
15+
[IRSA's website](https://irsa.ipac.caltech.edu/docs/notebooks/neowise-source-table-strategies.html).
16+
17+
+++
18+
1419
# Strategies to Efficiently Work with NEOWISE Single-exposure Source Table in Parquet
1520

1621
+++
1722

1823
This notebook discusses strategies for working with the Apache Parquet version of the
1924
[NEOWISE](https://irsa.ipac.caltech.edu/Missions/wise.html) Single-exposure Source Table
2025
and provides the basic code needed for each approach.
21-
This is a very large catalog -- 10 years and 40 terabytes in total with 145 columns and 200 billion rows.
26+
This is a very large catalog -- 11 years and 42 terabytes in total with 145 columns and 200 billion rows.
2227
Most of the work shown in this notebook is how to efficiently deal with so much data.
2328

2429
Learning Goals:
@@ -34,7 +39,7 @@ Learning Goals:
3439

3540
+++
3641

37-
The NEOWISE Single-exposure Source Table comprises 10 years of data.
42+
The NEOWISE Single-exposure Source Table comprises 11 years of data.
3843
Each year on its own would be considered "large" compared to astronomy catalogs produced
3944
contemporaneously, so working with the full dataset requires extra consideration.
4045
In this Parquet version, each year is stored as an independent Parquet dataset.
@@ -139,11 +144,11 @@ Expect the notebook to require about 4G RAM and 1 minute of runtime per year.
139144

140145
```{code-cell} ipython3
141146
# All NEOWISE years => about 40G RAM and 10 minutes runtime
142-
YEARS = list(range(1, 11))
147+
YEARS = [f"year{yr}" for yr in range(1, 12)] + ["addendum"]
143148
144149
# To reduce the needed RAM or runtime, uncomment the next line and choose your own years.
145-
# Years 1 and 9 are needed for the median_file and biggest_file (defined below).
146-
# YEARS = [1, 9]
150+
# Years 1 and 8 are needed for the median_file and biggest_file (defined below).
151+
# YEARS = [1, 8]
147152
```
148153

149154
Column and partition variables:
@@ -181,20 +186,20 @@ def neowise_path(year, file="_metadata"):
181186
# This information can be found at https://irsa.ipac.caltech.edu/cloud_access/.
182187
bucket = "nasa-irsa-wise"
183188
base_prefix = "wise/neowiser/catalogs/p1bs_psd/healpix_k5"
184-
root_dir = f"{bucket}/{base_prefix}/year{year}/neowiser-healpix_k5-year{year}.parquet"
189+
root_dir = f"{bucket}/{base_prefix}/{year}/neowiser-healpix_k5-{year}.parquet"
185190
return f"{root_dir}/{file}"
186191
```
187192

188193
Some representative partitions and files (see dataset stats in the Appendix for how we determine these values):
189194

190195
```{code-cell} ipython3
191196
# pixel index of the median partition and the biggest partition by number of rows
192-
median_part = 10_936
197+
median_part = 11_831
193198
biggest_part = 8_277
194199
195200
# path to the median file and the biggest file by file size on disk (see Appendix)
196-
median_file = neowise_path(9, "healpix_k0=3/healpix_k5=3420/part0.snappy.parquet")
197-
biggest_file = neowise_path(1, "healpix_k0=2/healpix_k5=2551/part0.snappy.parquet")
201+
median_file = neowise_path("year8", "healpix_k0=1/healpix_k5=1986/part0.snappy.parquet")
202+
biggest_file = neowise_path("year1", "healpix_k0=2/healpix_k5=2551/part0.snappy.parquet")
198203
```
199204

200205
Convenience function for displaying a table size:
@@ -400,7 +405,7 @@ for year, year_ds in zip(YEARS, neowise_ds.children):
400405
# we'll just look at some basic metadata.
401406
num_rows = sum(frag.metadata.num_rows for frag in year_ds.get_fragments())
402407
num_files = len(year_ds.files)
403-
print(f"NEOWISE year {year} dataset: {num_rows:,} rows in {num_files:,} files")
408+
print(f"NEOWISE {year} dataset: {num_rows:,} rows in {num_files:,} files")
404409
```
405410

406411
## Appendix
@@ -559,6 +564,6 @@ per_part.sort_values("numrows").iloc[len(per_part.index) // 2]
559564

560565
**Author:** Troy Raen (IRSA Developer) and the IPAC Science Platform team
561566

562-
**Updated:** 2024-08-08
567+
**Updated:** 2025-03-07
563568

564569
**Contact:** [the IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or reporting problems.

0 commit comments

Comments
 (0)