Skip to content

Commit b88c696

Browse files
Remove separation distance filtering on crossmatched catalog
1 parent 668a59a commit b88c696

File tree

1 file changed

+20
-48
lines changed

1 file changed

+20
-48
lines changed

tutorials/parquet-catalog-demos/irsa-hats-with-lsdb.md

Lines changed: 20 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -424,60 +424,32 @@ with Client(n_workers=get_nworkers(euclid_x_ztf),
424424
euclid_x_ztf_df
425425
```
426426

427-
### 5.3 [Optional] Filter the crossmatched catalog
428-
429-
Let's purify the crossmatched catalog by analyzing the distance between matched sources and removing the matches that don't meet a quality cut on percentile.
430-
We also keep the matches that are outside this cutoff but are still within the same 19th order HEALPix tile.
431-
432-
```{code-cell} ipython3
433-
euclid_x_ztf_df['_dist_arcsec'].describe()
434-
```
435-
436-
```{code-cell} ipython3
437-
euclid_x_ztf_filtered_df = euclid_x_ztf_df[
438-
(euclid_x_ztf_df['_dist_arcsec'] < euclid_x_ztf_df['_dist_arcsec'].quantile(0.75)) # keep matches within 75th percentile
439-
| (euclid_x_ztf_df['_healpix_19_euclid'] == euclid_x_ztf_df['_healpix_19_ztf']) # also include exact 19th order healpix matches
440-
].sort_values('_dist_arcsec')
441-
euclid_x_ztf_filtered_df
442-
```
443-
444-
```{code-cell} ipython3
445-
bins = np.histogram_bin_edges(euclid_x_ztf_df['_dist_arcsec'], bins=100)
446-
plt.hist(euclid_x_ztf_df['_dist_arcsec'], bins=bins, alpha=0.8, label='All matches <=1"')
447-
plt.hist(euclid_x_ztf_filtered_df['_dist_arcsec'], bins=bins, alpha=0.8, label='Filtered matches')
448-
plt.axvline(euclid_x_ztf_df['_dist_arcsec'].quantile(0.75), color='black', linestyle='dashed', linewidth=1, label='75th percentile')
449-
plt.xlabel('Distance (arcsec)')
450-
plt.ylabel('Count')
451-
plt.title('Separation between Euclid and ZTF cross-matched sources')
452-
plt.legend()
453-
plt.show()
454-
```
455-
456-
Going forward, we will use this purified crossmatched catalog `euclid_x_ztf_filtered_df` for analysis.
457-
458-
+++
459-
460-
### 5.4 Identify objects of interest from the crossmatch
427+
### 5.3 Identify objects of interest from the crossmatch
461428

462429
+++
463430

464431
Check the number of unique Euclid and ZTF sources in the crossmatched catalog:
465432

466433
```{code-cell} ipython3
467-
euclid_x_ztf_filtered_df.shape[0], euclid_x_ztf_filtered_df['object_id_euclid'].nunique(), euclid_x_ztf_filtered_df['oid_ztf'].nunique()
434+
euclid_x_ztf_df.shape[0], euclid_x_ztf_df['object_id_euclid'].nunique(), euclid_x_ztf_df['oid_ztf'].nunique()
468435
```
469436

470437
This means there is one unique Euclid source for each row in the crossmatched catalog as expected (since we put Euclid on the left side of the crossmatch).
471-
But for ZTF, this is also true, i.e., no ZTF object has multiple Euclid matches within our constraints.
472-
473-
Check if there is any ZTF object that has observations in multiple filters:
438+
But for ZTF, this is not true as some ZTF objects have multiple Euclid matches since ZTF has lower resolution than Euclid.
439+
Let's identify such cases:
474440

475441
```{code-cell} ipython3
476-
multi_filter_oids = euclid_x_ztf_filtered_df.groupby('oid_ztf')['fid_ztf'].nunique()
477-
multi_filter_oids
442+
many_euclid_x_one_ztf_df = euclid_x_ztf_df[
443+
# more than one Euclid object matched to the same ZTF object
444+
euclid_x_ztf_df.groupby('oid_ztf')['object_id_euclid'].transform('nunique') > 1
445+
].sort_values('oid_ztf')
446+
many_euclid_x_one_ztf_df[['object_id_euclid', 'oid_ztf', 'filtercode_ztf', '_dist_arcsec']]
478447
```
479448

449+
Let's also check if there is any ZTF object that has observations in multiple filters as it may warrant special handling:
450+
480451
```{code-cell} ipython3
452+
multi_filter_oids = euclid_x_ztf_df.groupby('oid_ztf')['fid_ztf'].nunique()
481453
multi_filter_oids[multi_filter_oids > 1].size
482454
```
483455

@@ -487,7 +459,7 @@ Now let's plot some variability metrics from ZTF against Euclid redshift to see
487459
We will use hexbin plots to visualize the density of sources in each panel:
488460

489461
```{code-cell} ipython3
490-
z = euclid_x_ztf_filtered_df["phz_phz_median_euclid"].to_numpy() # x-axis
462+
z = euclid_x_ztf_df["phz_phz_median_euclid"].to_numpy() # x-axis
491463
metrics = [ # y-axes
492464
("magrms_ztf", "ZTF mag RMS"),
493465
("chisq_ztf", "ZTF χ²"),
@@ -503,7 +475,7 @@ gridsize = 48 # resolution: larger => finer grid
503475
504476
for i, (col, ylabel) in enumerate(metrics):
505477
ax = axes[i]
506-
y = euclid_x_ztf_filtered_df[col].to_numpy()
478+
y = euclid_x_ztf_df[col].to_numpy()
507479
508480
# clip y to robust range (1–99th percentile) for visibility
509481
y_lo, y_hi = np.nanpercentile(y, [1, 99])
@@ -542,25 +514,25 @@ Despite this, we can still select some high-variability galaxy sources from the
542514
Let's focus only on Chi-squared (measure of significance) and RMS magnitude (measure of variability amplitude; similar to MAD) metrics:
543515

544516
```{code-cell} ipython3
545-
euclid_x_ztf_filtered_df['chisq_ztf'].describe()
517+
euclid_x_ztf_df['chisq_ztf'].describe()
546518
```
547519

548520
```{code-cell} ipython3
549-
chisq_threshold = euclid_x_ztf_filtered_df['chisq_ztf'].quantile(0.95)
521+
chisq_threshold = euclid_x_ztf_df['chisq_ztf'].quantile(0.95)
550522
chisq_threshold
551523
```
552524

553525
```{code-cell} ipython3
554-
euclid_x_ztf_filtered_df['magrms_ztf'].describe()
526+
euclid_x_ztf_df['magrms_ztf'].describe()
555527
```
556528

557529
```{code-cell} ipython3
558-
magrms_threshold = euclid_x_ztf_filtered_df['magrms_ztf'].quantile(0.95)
530+
magrms_threshold = euclid_x_ztf_df['magrms_ztf'].quantile(0.95)
559531
magrms_threshold
560532
```
561533

562534
```{code-cell} ipython3
563-
variable_galaxies = euclid_x_ztf_filtered_df.query(
535+
variable_galaxies = euclid_x_ztf_df.query(
564536
f"chisq_ztf >= {chisq_threshold} & magrms_ztf >= {magrms_threshold}"
565537
).sort_values("chisq_ztf", ascending=False) # sort by significant variability
566538
```
@@ -670,7 +642,7 @@ plt.show()
670642

671643
## About this notebook
672644

673-
Author: Jaladh Singhal, Troy Raen, and the IRSA Data Science Team
645+
Author: Jaladh Singhal, Troy Raen, Jessica Krick, Brigitta Sipőcz, and the IRSA Data Science Team
674646

675647
Updated: 2025-09-15
676648

0 commit comments

Comments
 (0)