Skip to content

Commit 57b4644

Browse files
committed
Edits: raster-io.md, raster-catalogs.pymd, raster-read.pymd.
1 parent 7302877 commit 57b4644

File tree

4 files changed

+48
-38
lines changed

4 files changed

+48
-38
lines changed

pyrasterframes/src/main/python/docs/raster-catalogs.pymd

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
# Raster Catalogs
22

3-
While much interesting processing can be done on a @ref:[single raster file](raster-read.md#single-raster), RasterFrames shines when _catalogs_ of raster data are to be processed. In its simplest form, a _catalog_ is a list of @ref:[URLs referencing raster files](raster-read.md#uri-formats). This list can be a Spark DataFrame, Pandas DataFrame, CSV file or CSV string. The _catalog_ is input into the `raster` DataSource, described in the @ref:[next page](raster-read.md), which creates _tiles_ from the rasters at the referenced URLs.
3+
While interesting processing can be done on a @ref:[single raster file](raster-read.md#single-raster), RasterFrames shines when _catalogs_ of raster data are to be processed. In its simplest form, a _catalog_ is a list of @ref:[URLs referencing raster files](raster-read.md#uri-formats). This list can be a Spark DataFrame, Pandas DataFrame, CSV file or CSV string. The _catalog_ is input into the `raster` DataSource described in the @ref:[next page](raster-read.md), which creates _tiles_ from the rasters at the referenced URLs.
44

55
A _catalog_ can have one or two dimensions:
66

77
* One-D: A single column contains raster URLs across the rows. All referenced rasters represent the same @ref:[band](concepts.md#band). For example, a column of URLs to Landsat 8 near-infrared rasters covering Europe. Each row represents different places and times.
8-
* Two-D: Many columns containing raster URLs. Each column references the same band, and each row represents the same place and time. For example, red-, green-, and blue-band columns for scenes covering Europe. Each row represents a single @ref:[scene](concepts.md#scene) with the same resolution, extent, [_CRS_][CRS], etc across the row.
8+
* Two-D: Many columns contain raster URLs. Each column references the same band, and each row represents the same place and time. For example, red-, green-, and blue-band columns for scenes covering Europe. Each row represents a single @ref:[scene](concepts.md#scene) with the same resolution, extent, [_CRS_][CRS], etc across the row.
99

1010
## Creating a Catalog
1111

12-
This section will provide some examples of creating your own _catalogs_, as well as introduce some experimental _catalogs_ built into RasterFrames. Reading raster data represented by a _catalog_ is covered in more detail in the @ref:[next page](raster-read.md).
12+
This section will provide some examples of _catalogs_ creation, as well as introduce some experimental _catalogs_ built into RasterFrames. Reading raster data represented by a _catalog_ is covered in more detail in the @ref:[next page](raster-read.md).
1313

1414
```python, setup, echo=False
1515
from pyrasterframes.utils import create_rf_spark_session
@@ -24,13 +24,12 @@ spark = create_rf_spark_session()
2424
A single URL is the simplest form of a catalog.
2525

2626
```python, oned_onerow_catalog
27-
from pyspark.sql import Row
28-
2927
file_uri = "/data/raster/myfile.tif"
3028
# Pandas DF
3129
my_cat = pd.DataFrame({'B01': [file_uri]})
3230

3331
# equivalent Spark DF
32+
from pyspark.sql import Row
3433
my_cat = spark.createDataFrame([Row(B01=file_uri)])
3534

3635
#equivalent CSV string
@@ -55,27 +54,33 @@ one_d_cat = '\n'.join(['B01', scene1_B01, scene2_B01])
5554

5655
### Two-D
5756

58-
Example of a multiple columns representing multiple content types (bands) across multiple scenes. In each row, the scene is the same: granule id `h04v09` on July 4 or July 7, 2018. The first column is band 1, red, and the second is band 2, near infrared.
57+
In this example, multiple columns representing multiple content types (bands) across multiple scenes. In each row, the scene is the same: granule id `h04v09` on July 4 or July 7, 2018. The first column is band 1, red, and the second is band 2, near infrared.
5958

6059
```python, twod_catalog
6160
scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"
6261
scene1_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B02.TIF"
6362
scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"
6463
scene2_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B02.TIF"
6564

65+
# Pandas DF
66+
my_cat = pd.DataFrame([
67+
{'B01': [scene1_B01], 'B02': [scene1_B02]},
68+
{'B01': [scene2_B01], 'B02': [scene2_B02]}
69+
])
6670

67-
# As CSV string
68-
my_cat = '\n'.join(['B01,B02', scene1_B01 + "," + scene1_B02, scene2_B01 + "," + scene2_B02])
6971
# or
7072
my_cat_df = spark.createDataFrame([
7173
Row(B01=scene1_B01, B02=scene1_B02),
72-
Row(B01=scene2_B01, B02=scene2_B02)])
73-
my_cat_df.printSchema()
74+
Row(B01=scene2_B01, B02=scene2_B02)
75+
])
76+
77+
# As CSV string
78+
my_cat = '\n'.join(['B01,B02', scene1_B01 + "," + scene1_B02, scene2_B01 + "," + scene2_B02])
7479
```
7580

7681
## Using External Catalogs
7782

78-
The concept of a _catalog_ is much more powerful when we consider examples beyond constructing the DataFrame, and instead read the data from an external source. Here's an extended example of reading an cloud-hosted CSV file containing MODIS scene metadata and transforming it into a _catalog_. The metadata describing the content of each URL is an important aspect of processing raster data.
83+
The concept of a _catalog_ is much more powerful when we consider examples beyond constructing the DataFrame, and instead read the data from an external source. Here's an extended example of reading a cloud-hosted CSV file containing MODIS scene metadata and transforming it into a _catalog_. The metadata describing the content of each URL is an important aspect of processing raster data.
7984

8085
```python, remote_csv, results='raw'
8186
from pyspark import SparkFiles
@@ -103,17 +108,17 @@ modis_catalog = scene_list \
103108
modis_catalog.show(4, truncate=True)
104109
```
105110

106-
## Using Built-in Experimental Catalogs
111+
## Using Built-in Catalogs
107112

108113
RasterFrames comes with two experimental catalogs over the AWS PDS [Landsat 8][Landsat] and [MODIS][MODIS] repositories. They are created by downloading the latest scene lists and building up the appropriate band URI columns as in the prior example.
109114

110-
> Note: The first time you run these may take some time, as the catalogs are large. However, they are cached and subsequent invocations should be faster.
115+
> Note: The first time you run these may take some time, as the catalogs are large and have to be downloaded. However, they are cached and subsequent invocations should be faster.
111116

112117
### MODIS
113118

114119
```python, evaluate=False
115-
modis_catalog2 = spark.read.format('aws-pds-modis-catalog').load()
116-
modis_catalog2.printSchema()
120+
modis_catalog = spark.read.format('aws-pds-modis-catalog').load()
121+
modis_catalog.printSchema()
117122
```
118123
```
119124
root

pyrasterframes/src/main/python/docs/raster-io.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,13 @@ The standard mechanism by which any data is brought in and out of a Spark Datafr
1111
- `geotiff`: a simplified reader for reading a single GeoTIFF file
1212
- `geotrellis`: for reading a [GeoTrellis layer][GTLayer]
1313
* @ref:[Raster Writers](raster-write.md)
14-
- You can write @ref:[Tile](raster-write.md#tile-samples) and @ref:[DataFrame](raster-write.md#dataframe-samples) samples
1514
- @ref:[`geotiff`](raster-write.md#geotiffs): beta writer to GeoTiff file format
1615
- @ref:[`geotrellis`](raster-write.md#geotrellis-layers): creating a [GeoTrellis layer][GTLayer]
1716
- @ref:[`parquet`](raster-write.md#parquet): general purpose writer for [Parquet][Parquet]
1817

18+
19+
Furthermore, when in a Jupyter Notebook environment, you can view @ref:[Tile](raster-write.md#tile-samples) and @ref:[DataFrame](raster-write.md#dataframe-samples) samples.
20+
1921
There is also support for @ref:[vector data](vector-data.md) for masking and data labeling.
2022

2123
@@@ index

0 commit comments

Comments
 (0)