Skip to content

Conversation

@jaladh-singhal
Copy link
Member

@jaladh-singhal jaladh-singhal commented Sep 14, 2025

Addresses IRSA-7104

NOTE: The time performance of the notebook depends on the spatial filters defined in the beginning. I'm working with a cone at Euclid Deep Field North center, with following radius size:

cone size notebook executed where total execution time comments
0.5 deg locally 4.5-5 min ~2 min for each of two lsdb .compute() calls (was 4+4min on Friday)
0.5 deg CircleCI rendering 101s = 1.68min (average of two runs) surprisingly less! I was expecting it to be more than local. Is it because CircleCI is also on AWS or something else?
0.5 deg Fornax SP 75s = 1.25min fastest of all as expected
3 deg (entire field) Fornax SP 192s = 3.2 min (1.5min for 1st compute()) 6 times more cone radius is still faster than locally! Crossmatch results include more interesting candidates

@jaladh-singhal jaladh-singhal self-assigned this Sep 14, 2025
@jaladh-singhal jaladh-singhal added the content Content related issues/PRs. label Sep 14, 2025
Copy link
Contributor

@troyraen troyraen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll come back and look in more detail, but here's a couple of things I noticed quickly.

Will you make the markdown text one line per sentence? We try to stick with that in these repos because it makes the reviews a lot easier.

@troyraen troyraen self-requested a review September 15, 2025 17:23
@bsipocz
Copy link
Member

bsipocz commented Sep 16, 2025

@jaladh-singhal - could you roll back your last commit and break the Markdown at the end of the sentences? Breaking them at char lenght will result way bigger than necessary diffs in the future --> the markdowns we thus break at the end of the sentences. (and a good rule of thumb is to write shorter sentences if a line becomes really long :) )

Copy link
Contributor

@jkrick jkrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in slack, I think this is a super useful notebook, so thanks for writing it. I have a few comments below, but traditionally this repo doesn't get quite the scrutiny that fornax-demo-notebooks does, so please take these as suggestions.

I think this title does work, but to be nitpicky, I would like to also see the words "Euclid" and "ZTF" in there to help users wanting those particular datasets. Some suggestions:
"Working with parquet Collections: Euclid Q1 & ZTF DR23 via lsdb" or "HATS-Based Analysis of Euclid and ZTF Catalogs" or "Using lsdb to Filter and Match Euclid and ZTF Cloud Catalogs"

split section 5.2 into two sections, 5.2 Execute the query and 5.3 filter data frame

My biggest comment is that the notebook feels very long to me. Sections I think could be cut:

  • don't need to see the whole euclid_schema_df, so no need to print it out in full in section 3.2
  • I think one filtering example on the column names would be enough, so suggest to keep: "euclid_schema_df[
    euclid_schema_df["name"].str.startswith("phz_")
    & euclid_schema_df["type"].str.contains("int") # to see flag type columns
    ]" and remove "euclid_schema_df[euclid_schema_df["name"].str.startswith("phz_")] # phz_ prefix is for PHZ catalog columns in this merged catalog"
  • section 4 can be shortened because it is really a repeat of section 3 with replacing "Euclid" with "ZTF", so I would suggest limiting to just what we need for the rest of the notebook.
  • section 5.3, suggest remove the parts about "check if there is any ZTF object that has observations in multiple filters", go straight to the plots
  • I don't think the number of galaxies as a function of redshift is an important plot since much of this information can be gleaned from the hex bin plots above.

And I put these in slack before I knew I was doing a full review, so here are the requested comments on the specific filters, copied here so we can track them:

  • In section 5.2, not sure why you keep those sources within the same healpix area but outside of 75percentile.
  • I wonder if you limited to redshifts below 1 if you would be able to see trends in the hexbin plots as a function of redshift better. I would probably filter on redshift when you do the initial filtering with "galaxy_filters"
  • what is the magnitude limit of ZTF in r band? Those data points in the early times with the larger error bars appear to be pulling up the RMS mag and other metrics. I wonder if they are good data points or not. For the light curve notebook we wrote a while ago we filtered on "catflags < 32768" to remove bad flags, can we do that with this notebook as well?

@jkrick
Copy link
Contributor

jkrick commented Sep 18, 2025

I just tried running this on fornax, and am getting the following AWS access error on the first cell in section 3.2 which calls pq.read_schema on the euclid path.


OSError Traceback (most recent call last)
Cell In[20], line 1
----> 1 euclid_schema = pq.read_schema(
2 f"s3://{euclid_q1_bucket}/{euclid_q1_hats_prefix}/{euclid_q1_schema_path}"
3 )
4 type(euclid_schema)

File /opt/envs/python3/lib/python3.12/site-packages/pyarrow/parquet/core.py:2389, in read_schema(where, memory_map, decryption_properties, filesystem)
2387 file_ctx = nullcontext()
2388 if filesystem is not None:
-> 2389 file_ctx = where = filesystem.open_input_file(where)
2391 with file_ctx:
2392 file = ParquetFile(
2393 where, memory_map=memory_map,
2394 decryption_properties=decryption_properties)

File /opt/envs/python3/lib/python3.12/site-packages/pyarrow/_fs.pyx:814, in pyarrow._fs.FileSystem.open_input_file()

File /opt/envs/python3/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/envs/python3/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

OSError: When reading information for key 'contributed/q1/merged_objects/hats/euclid_q1_merged_objects-hats/dataset/_common_metadata' in bucket 'nasa-irsa-euclid-q1': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

@jaladh-singhal
Copy link
Member Author

I just tried running this on fornax, and am getting the following AWS access error on the first cell in section 3.2 which calls pq.read_schema on the euclid path.

@jkrick good catch. I also ran it at Fornax and ran into this problem. Will push a fix soon to PR so that it works at Fornax as well as here in CI.

@jaladh-singhal
Copy link
Member Author

@jkrick you can try it now - pushed an update and tested it on Fornax myself, it works! Note: you might wanna increase cone radius, adjust plot X/Y bounds, etc. on Fornax

@bsipocz
Copy link
Member

bsipocz commented Sep 18, 2025

but traditionally this repo doesn't get quite the scrutiny that fornax-demo-notebooks does

@jkrick - I like to think the other way around, I'm more certain that these notebooks will work out of the box. But I see what you may mean about the content -- and for that I think we should be very open for incoming improvements, or better document that these are showcases to focus on the data rather than the scientific process. If you have any good description that makes this approach clear I think we should add that to the index page

@bsipocz bsipocz force-pushed the IRSA-7104-hats-lsdb branch from bed117a to a82d725 Compare September 18, 2025 22:31
@bsipocz
Copy link
Member

bsipocz commented Sep 23, 2025

Just a heads-up, I'm kind of hijacking this PR to do some caching experiment, too. --> ignore the circleCI statuses and/or if you see any strange errors for circleCI

@bsipocz bsipocz force-pushed the IRSA-7104-hats-lsdb branch from 7846e3c to ba60bd8 Compare September 23, 2025 20:21
@bsipocz
Copy link
Member

bsipocz commented Sep 26, 2025

This one needs a careful rebase now as we switch the system. Let me know if you rather have me doing it.

@jaladh-singhal
Copy link
Member Author

This one needs a careful rebase now as we switch the system. Let me know if you rather have me doing it.

@bsipocz sure go for it!

@bsipocz
Copy link
Member

bsipocz commented Sep 26, 2025

OK, so question: which new subject category this should go into?

@bsipocz bsipocz force-pushed the IRSA-7104-hats-lsdb branch from 24b6c24 to d8a7dab Compare September 26, 2025 23:00
@bsipocz
Copy link
Member

bsipocz commented Sep 26, 2025

Right now it's in "Special Topics" -- I feel that should not "other" the cloud and parquet/hats stuff, but probably also should not hide it underneath one of the surveys. Maybe "big data astronomy?"

@bsipocz bsipocz added the GHA buildhtml Enable extra buildhtml job on GHA label Sep 26, 2025
@bsipocz bsipocz closed this Sep 26, 2025
@bsipocz bsipocz reopened this Sep 26, 2025
@bsipocz
Copy link
Member

bsipocz commented Sep 27, 2025

Closed and reopened to trigger the GHA docs build, too. I do that to make sure we don't run out of resources for deployment (that is not an issue on circleCI as a bunch of other notebooks are excluded, but I noticed that this one piked up to 70%)

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendering looks good to me, testing, too, so I would say let's go ahead and merge this.

And I have added the GHA html build job, too to ensure this would not make that build run out of resources. IMO we should do add that extra job when we add a new notebook or do significant (performance affecting) changes.

@jaladh-singhal
Copy link
Member Author

@bsipocz - I still need to address some of the science feedback I got from Jessica, Harry/Anhita so I'll let you know when it's ready.

Thanks for the infra updates and making it ready!

@bsipocz
Copy link
Member

bsipocz commented Sep 29, 2025

I still need to address some of the science feedback I got from Jessica, Harry/Anhita so I'll let you know when it's ready.

That's all good, I don't need to sign off on this again unless you feel you made significant infra choices or want me to double check something.

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaladh-singhal - I keep seeing this error in rendering: https://github.com/Caltech-IPAC/irsa-tutorials/actions/runs/18145382851/job/51645786712?pr=136#step:5:291

It's really puzzling that it doesn't show up as a test failure with pytest, could you have a look at it though?

In the meantime I'm blocking this PR from merging as we need to understand what is going on.

@jaladh-singhal
Copy link
Member Author

jaladh-singhal commented Sep 30, 2025

@bsipocz I'll take a look - how's "buildhtml testing" job different than pytest ones and circle ci rendering? I mean in terms of its purpose because the other jobs execute the "id search" just fine.

@bsipocz
Copy link
Member

bsipocz commented Sep 30, 2025

and circle ci rendering

It's known ante-feature that it doesn't have a failing status when there is a failing notebook in it: it you look into the build log the traceback is there

@jaladh-singhal
Copy link
Member Author

jaladh-singhal commented Oct 1, 2025

@jaladh-singhal - I keep seeing this error in rendering: https://github.com/Caltech-IPAC/irsa-tutorials/actions/runs/18145382851/job/51645786712?pr=136#step:5:291

It's really puzzling that it doesn't show up as a test failure with pytest, could you have a look at it though?

In the meantime I'm blocking this PR from merging as we need to understand what is going on.

The TypeError was happening since the lsdb->hats->pyarrow machinery was picking newest universal_pathlib v0.3 released yesterday which changed the object/type. Pinning it for now until they resolve the typing mismatch (opened issue upstream: astronomy-commons/lsdb#1047).

@bsipocz the buildhtml job now passes for my notebook, it's failing for unrelated reason (some DOI is not found)

@troyraen
Copy link
Contributor

troyraen commented Oct 1, 2025

I'm in the middle of reviewing this. Please don't merge yet.

@bsipocz
Copy link
Member

bsipocz commented Oct 1, 2025

The DOIs are not expected to fail the build, and I'm dealing with the making sure tracebacks will fail the job in a separate PR.

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment, then this is good to go from my end.

Copy link
Contributor

@troyraen troyraen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaladh-singhal!

@bsipocz bsipocz removed the GHA buildhtml Enable extra buildhtml job on GHA label Oct 1, 2025
@troyraen
Copy link
Contributor

troyraen commented Oct 2, 2025

Thanks @jaladh-singhal this looks great! And I just ran it on Fornax and it all went well. Merging now.

@troyraen troyraen merged commit d4cfd94 into Caltech-IPAC:main Oct 2, 2025
8 checks passed
github-actions bot pushed a commit that referenced this pull request Oct 3, 2025
IRSA-7104: Add "Access IRSA HATS using lsdb" notebook d4cfd94
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

content Content related issues/PRs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants