Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions tutorials/cloud_access/cloud-access-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,6 @@ Learning Goals:

## 1. Cloud basics

### 1.1 Terminology

AWS S3 is an [object store](https://en.wikipedia.org/wiki/Object_storage) where the fundamental entities are "buckets" and "objects".
Buckets are containers for objects, and objects are blobs of data.
Users may be more familiar with [filesystem](https://en.wikipedia.org/wiki/File_system) "files" and "directories".
Expand All @@ -43,15 +41,17 @@ The following S3 terms are also used in this notebook:

+++

### 1.2 General access
### 1.1 General access

Most of the common python methods used to read images and catalogs from a local disk can also be pointed at cloud storage buckets.
This includes methods like Astropy `fits.open` and Pandas `read_parquet`.
The cloud connection is handled by a separate library, usually [s3fs](https://s3fs.readthedocs.io), [fsspec](https://filesystem-spec.readthedocs.io), or [pyarrow.fs](https://arrow.apache.org/docs/python/api/filesystems.html).

The IRSA buckets are public and access is free.
Credentials are not required.
Anonymous connections can be made, often by setting a keyword argument like `anon=True`.
However, most tools will look for credentials by default and raise an error when none are found.
To avoid this, users can make an "anonymous" connection, usually with a keyword argument such as `anon=True`.
This notebook demonstrates with the `s3fs`, `astropy`, and `pyarrow` libraries.

+++

Expand Down
12 changes: 6 additions & 6 deletions tutorials/parquet-catalog-demos/wise-allwise-catalog-demo.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ kernelspec:
## Introduction

This notebook demonstrates access to the [HEALPix](https://ui.adsabs.harvard.edu/abs/2005ApJ...622..759G/abstract)-partitioned (order 5), [Apache Parquet](https://parquet.apache.org/) version of the [AllWISE Source Catalog](https://wise2.ipac.caltech.edu/docs/release/allwise/expsup/sec1_3.html#src_cat).
The catalog is available through the [AWS Open Data](https://aws.amazon.com/opendata) program, as part of the [NASA Open-Source Science Initiative](https://science.nasa.gov/open-science-overview).
The catalog is available through the [AWS Open Data](https://registry.opendata.aws/wise-allwise/) program, as part of the [NASA Open-Source Science Initiative](https://science.nasa.gov/open-science-overview).
Access is free and no special permissions or credentials are required.

Parquet is convenient for large astronomical catalogs in part because the storage format supports efficient database-style queries on the files themselves, without having to load the catalog into a database (or into memory) first.
The AllWISE catalog is fairly large at 340 GB.
Expand Down Expand Up @@ -65,13 +66,12 @@ from pyarrow.fs import S3FileSystem

+++

This AllWISE catalog is stored in an [AWS S3](https://aws.amazon.com/s3/) bucket.
To connect to an S3 bucket we just need to point the reader at S3 instead of the local filesystem, and pass in AWS credentials.
This AllWISE catalog is stored in an [AWS S3](https://aws.amazon.com/s3/) cloud storage bucket.
To connect to an S3 bucket we just need to point the reader at S3 instead of the local filesystem.
(Here, a "reader" is a python library that reads parquet files.)
We'll use [pyarrow.fs.S3FileSystem](https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html) for this because it is recognized by every reader in examples below, and we're already using pyarrow.
[s3fs](https://s3fs.readthedocs.io/en/latest/index.html) is another common option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's covered in the cloud access notebook that I added a link to and I felt that it interrupted the flow of this paragraph much more than it added. Do you think it's particularly valuable here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not critical at all, but seemed small enough and useful enough to leave here, since many users won't follow the link.

The call to `S3FileSystem` will look for AWS credentials in environment variables and/or the file ~/.aws/credentials.
Credentials can also be passed as keyword arguments.
To access without credentials, we'll use the keyword argument `anonymous=True`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To access without credentials, we'll use the keyword argument `anonymous=True`.
To avoid an error about credentials, we'll use the keyword argument `anonymous=True`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for saying the argument helps avoid an error? The same could be said for most arguments. Rereading what I changed above, I see that this mirrors the "To avoid this" that I put in the other notebook. It makes it sound like the keyword argument is a workaround for some problem, but a lack of credentials isn't a problem. Maybe I'll change that a bit to phrase it more positively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super important, I just thought we should justify the keyword argument as being associated with not needing credentials, and I assume people won't read all the text, so repetition seems appropriate.

More information about accessing S3 buckets can be found at [](#cloud-access-intro).

```{code-cell} ipython3
bucket = "nasa-irsa-wise"
Expand Down