-
Notifications
You must be signed in to change notification settings - Fork 5
IRSA-7104: Add "Access IRSA HATS using lsdb" notebook #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IRSA-7104: Add "Access IRSA HATS using lsdb" notebook #136
Conversation
troyraen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll come back and look in more detail, but here's a couple of things I noticed quickly.
Will you make the markdown text one line per sentence? We try to stick with that in these repos because it makes the reviews a lot easier.
|
@jaladh-singhal - could you roll back your last commit and break the Markdown at the end of the sentences? Breaking them at char lenght will result way bigger than necessary diffs in the future --> the markdowns we thus break at the end of the sentences. (and a good rule of thumb is to write shorter sentences if a line becomes really long :) ) |
8b8d306 to
8484fba
Compare
jkrick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in slack, I think this is a super useful notebook, so thanks for writing it. I have a few comments below, but traditionally this repo doesn't get quite the scrutiny that fornax-demo-notebooks does, so please take these as suggestions.
I think this title does work, but to be nitpicky, I would like to also see the words "Euclid" and "ZTF" in there to help users wanting those particular datasets. Some suggestions:
"Working with parquet Collections: Euclid Q1 & ZTF DR23 via lsdb" or "HATS-Based Analysis of Euclid and ZTF Catalogs" or "Using lsdb to Filter and Match Euclid and ZTF Cloud Catalogs"
split section 5.2 into two sections, 5.2 Execute the query and 5.3 filter data frame
My biggest comment is that the notebook feels very long to me. Sections I think could be cut:
- don't need to see the whole euclid_schema_df, so no need to print it out in full in section 3.2
- I think one filtering example on the column names would be enough, so suggest to keep: "euclid_schema_df[
euclid_schema_df["name"].str.startswith("phz_")
& euclid_schema_df["type"].str.contains("int") # to see flag type columns
]" and remove "euclid_schema_df[euclid_schema_df["name"].str.startswith("phz_")] # phz_ prefix is for PHZ catalog columns in this merged catalog" - section 4 can be shortened because it is really a repeat of section 3 with replacing "Euclid" with "ZTF", so I would suggest limiting to just what we need for the rest of the notebook.
- section 5.3, suggest remove the parts about "check if there is any ZTF object that has observations in multiple filters", go straight to the plots
- I don't think the number of galaxies as a function of redshift is an important plot since much of this information can be gleaned from the hex bin plots above.
And I put these in slack before I knew I was doing a full review, so here are the requested comments on the specific filters, copied here so we can track them:
- In section 5.2, not sure why you keep those sources within the same healpix area but outside of 75percentile.
- I wonder if you limited to redshifts below 1 if you would be able to see trends in the hexbin plots as a function of redshift better. I would probably filter on redshift when you do the initial filtering with "galaxy_filters"
- what is the magnitude limit of ZTF in r band? Those data points in the early times with the larger error bars appear to be pulling up the RMS mag and other metrics. I wonder if they are good data points or not. For the light curve notebook we wrote a while ago we filtered on "catflags < 32768" to remove bad flags, can we do that with this notebook as well?
|
I just tried running this on fornax, and am getting the following AWS access error on the first cell in section 3.2 which calls pq.read_schema on the euclid path. OSError Traceback (most recent call last) File /opt/envs/python3/lib/python3.12/site-packages/pyarrow/parquet/core.py:2389, in read_schema(where, memory_map, decryption_properties, filesystem) File /opt/envs/python3/lib/python3.12/site-packages/pyarrow/_fs.pyx:814, in pyarrow._fs.FileSystem.open_input_file() File /opt/envs/python3/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status() File /opt/envs/python3/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status() OSError: When reading information for key 'contributed/q1/merged_objects/hats/euclid_q1_merged_objects-hats/dataset/_common_metadata' in bucket 'nasa-irsa-euclid-q1': AWS Error ACCESS_DENIED during HeadObject operation: No response body. |
@jkrick good catch. I also ran it at Fornax and ran into this problem. Will push a fix soon to PR so that it works at Fornax as well as here in CI. |
|
@jkrick you can try it now - pushed an update and tested it on Fornax myself, it works! Note: you might wanna increase cone radius, adjust plot X/Y bounds, etc. on Fornax |
@jkrick - I like to think the other way around, I'm more certain that these notebooks will work out of the box. But I see what you may mean about the content -- and for that I think we should be very open for incoming improvements, or better document that these are showcases to focus on the data rather than the scientific process. If you have any good description that makes this approach clear I think we should add that to the index page |
bed117a to
a82d725
Compare
|
Just a heads-up, I'm kind of hijacking this PR to do some caching experiment, too. --> ignore the circleCI statuses and/or if you see any strange errors for circleCI |
7846e3c to
ba60bd8
Compare
|
This one needs a careful rebase now as we switch the system. Let me know if you rather have me doing it. |
@bsipocz sure go for it! |
|
OK, so question: which new subject category this should go into? |
24b6c24 to
d8a7dab
Compare
|
Right now it's in "Special Topics" -- I feel that should not "other" the cloud and parquet/hats stuff, but probably also should not hide it underneath one of the surveys. Maybe "big data astronomy?" |
|
Closed and reopened to trigger the GHA docs build, too. I do that to make sure we don't run out of resources for deployment (that is not an issue on circleCI as a bunch of other notebooks are excluded, but I noticed that this one piked up to 70%) |
bsipocz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rendering looks good to me, testing, too, so I would say let's go ahead and merge this.
And I have added the GHA html build job, too to ensure this would not make that build run out of resources. IMO we should do add that extra job when we add a new notebook or do significant (performance affecting) changes.
|
@bsipocz - I still need to address some of the science feedback I got from Jessica, Harry/Anhita so I'll let you know when it's ready. Thanks for the infra updates and making it ready! |
That's all good, I don't need to sign off on this again unless you feel you made significant infra choices or want me to double check something. |
bsipocz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaladh-singhal - I keep seeing this error in rendering: https://github.com/Caltech-IPAC/irsa-tutorials/actions/runs/18145382851/job/51645786712?pr=136#step:5:291
It's really puzzling that it doesn't show up as a test failure with pytest, could you have a look at it though?
In the meantime I'm blocking this PR from merging as we need to understand what is going on.
|
@bsipocz I'll take a look - how's "buildhtml testing" job different than pytest ones and circle ci rendering? I mean in terms of its purpose because the other jobs execute the "id search" just fine. |
It's known ante-feature that it doesn't have a failing status when there is a failing notebook in it: it you look into the build log the traceback is there |
The TypeError was happening since the lsdb->hats->pyarrow machinery was picking newest universal_pathlib v0.3 released yesterday which changed the object/type. Pinning it for now until they resolve the typing mismatch (opened issue upstream: astronomy-commons/lsdb#1047). @bsipocz the buildhtml job now passes for my notebook, it's failing for unrelated reason (some DOI is not found) |
|
I'm in the middle of reviewing this. Please don't merge yet. |
|
The DOIs are not expected to fail the build, and I'm dealing with the making sure tracebacks will fail the job in a separate PR. |
bsipocz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment, then this is good to go from my end.
troyraen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jaladh-singhal!
Co-authored-by: Brigitta Sipőcz <[email protected]>
Co-authored-by: Troy Raen <[email protected]>
|
Thanks @jaladh-singhal this looks great! And I just ran it on Fornax and it all went well. Merging now. |
IRSA-7104: Add "Access IRSA HATS using lsdb" notebook d4cfd94
Addresses IRSA-7104
NOTE: The time performance of the notebook depends on the spatial filters defined in the beginning. I'm working with a cone at Euclid Deep Field North center, with following radius size: