Skip to content

Commit 8437eb0

Browse files
authored
Merge pull request #260 from s22s/docs/fix-250-89
Docs fixes in raster read
2 parents edee965 + 8da23d1 commit 8437eb0

File tree

4 files changed

+50
-22
lines changed

4 files changed

+50
-22
lines changed

pyrasterframes/src/main/python/docs/raster-read.pymd

Lines changed: 47 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,13 @@ from pyrasterframes.rasterfunctions import *
88
spark = create_rf_spark_session()
99
```
1010

11-
RasterFrames registers a DataSource named `raster` that enables reading of GeoTIFFs (and other formats when @ref:[GDAL is installed](getting-started.md#installing-gdal)) from arbitrary URIs. In the examples that follow we'll be reading from a Sentinel-2 scene stored in an AWS S3 bucket.
11+
RasterFrames registers a DataSource named `raster` that enables reading of GeoTIFFs (and other formats when @ref:[GDAL is installed](getting-started.md#installing-gdal)) from arbitrary URIs. The `raster` DataSource operates on either a single raster file location or another DataFrame, called a _catalog_, containing pointers to many raster file locations.
12+
13+
RasterFrames can also read from @ref:[GeoTrellis catalogs and layers](raster-read.md#geotrellis).
1214

1315
## Single Raster
1416

15-
The simplest form is reading a single raster from a single URI.
17+
The simplest way to use the `raster` reader is with a single raster from a single URI or file. In the examples that follow we'll be reading from a Sentinel-2 scene stored in an AWS S3 bucket.
1618

1719
```python read_one_uri
1820
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif')
@@ -56,19 +58,19 @@ RasterFrames relies on three different IO drivers, selected based on a combinati
5658

5759
| Prefix | GDAL | Java I/O | Hadoop |
5860
| ------------------- | ----------- | -------- | -------- |
59-
| `gdal://<vsidrv>//` | + | - | - |
60-
| `file://` | + | + | - |
61-
| `http://` | + | + | - |
62-
| `https://` | + | + | - |
63-
| `ftp://` | `/vsicurl/` | + | - |
64-
| `hdfs://` | `/vsihdfs/` |- | + |
65-
| `s3://` | `/vsis3/` | + | - |
66-
| `s3n://` | - | - | + |
67-
| `s3a://` | - | - | + |
68-
| `wasb://` | `/vsiaz/` | - | + |
69-
| `wasbs://` | - | - | + |
61+
| `gdal://<vsidrv>//` | yes | no | no |
62+
| `file://` | yes | yes | no |
63+
| `http://` | yes | yes | no |
64+
| `https://` | yes | yes | no |
65+
| `ftp://` | `/vsicurl/` | yes | no |
66+
| `hdfs://` | `/vsihdfs/` | no | yes |
67+
| `s3://` | `/vsis3/` | yes | no |
68+
| `s3n://` | no | no | yes |
69+
| `s3a://` | no | no | yes |
70+
| `wasb://` | `/vsiaz/` | no | yes |
71+
| `wasbs://` | no | no | yes |
7072

71-
Specific [GDAL Virtual File System drivers](https://gdal.org/user/virtual_file_systems.html) can be selected using the `gdal://<vsidrv>//` syntax. For example If you have a `archive.zip` file containing a GeoTiff named `my-file-inside.tif`, you can address it with `gdal://vsizip//path/to/archive.zip/my-file-inside.tif`. See the GDAL documentation for the format of the URIs after the `gdal:/` prefix (which is stripped off before passing the rest of the path to GDAL).
73+
Specific [GDAL Virtual File System drivers](https://gdal.org/user/virtual_file_systems.html) can be selected using the `gdal://<vsidrv>//` syntax. For example If you have a `archive.zip` file containing a GeoTiff named `my-file-inside.tif`, you can address it with `gdal://vsizip//path/to/archive.zip/my-file-inside.tif`. Another example would be a MRF file in an S3 bucket on AWS: `gdal://vsis3/my-bucket/prefix/to/raster.mrf`. See the GDAL documentation for the format of the URIs after the `gdal:/` scheme. The `gdal:/` scheme is stripped off before passing the rest of the path to GDAL.
7274

7375

7476
## Raster Catalogs
@@ -127,11 +129,11 @@ Observe that the schema of the resulting DataFrame has a projected raster struct
127129
rf.select('gid', rf_extent('red'), rf_extent('nir'), rf_tile('red'), rf_tile('nir')).show(3, False)
128130
```
129131

130-
### Lazy Raster Reads
132+
## Lazy Raster Reads
131133

132-
By default the raster reads are delayed as long as possible. The DataFrame will contain metadata and pointers to the appropriate portion of the data until
134+
By default the raster reads are delayed as long as possible. The DataFrame will contain metadata and pointers to the appropriate portion of the data until reading of the source raster data is absolutely necessary. This can save a lot of computation and I/O time for two reasons. One is that a _catalog_ may contain millions of rows. Second is that the `raster` DataSource attempts to ensure filters are processed before reading raster data.
133135

134-
Consider the following two reads of the same data source. In the first, the lazy case, there is a pointer to the URI, extent and band to read. This will not be evaluated until the cell values are absolutely required. The second case shows the realized tile is queried right away.
136+
Consider the following two reads of the same data source. In the first, the lazy case, there is a pointer to the URI, extent and band to read. This will not be evaluated until the cell values are absolutely required. The second case shows the option to force the raster to be fully read right away.
135137

136138
```python lazy_demo
137139
uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif'
@@ -142,15 +144,17 @@ spark.read.raster(uri, lazy_tiles=False) \
142144
.select('proj_raster.tile').show(1, False)
143145
```
144146

145-
In the initial examples on this page, we used @ref:[`rf_tile`](reference.md#rf-tile) to explicitly request the realized tile from the lazy representation.
147+
In the initial examples on this page, you may have noticed that the realized (non-lazy) tiles are shown, but we did not change `lazy_tiles`. Instead, we used @ref:[`rf_tile`](reference.md#rf-tile) to explicitly request the realized tile from the lazy representation.
146148

147149
## Multiband Rasters
148150

149151
A multiband raster represents a three dimensional numeric array. The first two dimensions are spatial, and the third dimension is typically designated as different @ref:[bands](concepts.md#band). The bands could represent intensity of different wavelengths of light (or other electromagnetic radiation), or they could represent other phenomena such as measurement time, quality indications, or additional measurements.
150152

151-
When reading a multiband raster or a _Catalog_ describing multiband rasters, you will need to know ahead of time which bands you want to read. You will specify the bands to read, indexed from zero, passing a list of integers into the `band_indexes` parameter of the `raster` reader.
153+
Multiband rasters files have a strictly ordered set of bands, which are typically indexed from 1. Some files have metadata tags associated with each band. Some files have a color interpetation metadata tag indicating how to interpret the bands.
154+
155+
When reading a multiband raster or a _catalog_ describing multiband rasters, you will need to know ahead of time which bands you want to read. You will specify the bands to read, **indexed from zero**, as a list of integers into the `band_indexes` parameter of the `raster` reader.
152156

153-
For example we can read a four-band (red, green, blue, and near-infrared) image as follows. The individual rows of the resulting DataFrame still represent distinct spatial extents, with a projected raster column for each band specified by `band_indexes`.
157+
For example, we can read a four-band (red, green, blue, and near-infrared) image as follows. The individual rows of the resulting DataFrame still represent distinct spatial extents, with a projected raster column for each band specified by `band_indexes`.
154158

155159
```python Multiband
156160
mb = spark.read.raster('s3://s22s-test-geotiffs/naip/m_3807863_nw_17_1_20160620.tif',
@@ -180,6 +184,29 @@ mb2 = spark.read.raster(catalog=spark.createDataFrame(mb_cat),
180184
mb2.printSchema()
181185
```
182186

187+
## GeoTrellis
188+
189+
### GeoTrellis Catalogs
190+
191+
[GeoTrellis][GeoTrellis] is one of the key libraries that RasterFrames builds upon. It provides a Scala language API to working with large raster data with Apache Spark. RasterFrames provides a DataSource that supports both reading and @ref:[writing](raster-write.md#geotrellis-layers) with GeoTrellis.
192+
193+
A GeoTrellis catalog is a set of GeoTrellis layers. We can read a dataframe giving details of the content of a catalog using the following. The scheme is typically `hdfs` or a cloud storage provider like `s3` or `wasb`.
194+
195+
```python, evaluate=False
196+
gt_cat = spark.read.geotrellis_catalog('scheme://path-to-gt-catalog')
197+
```
198+
199+
### GeoTrellis Layers
200+
201+
The catalog will give details on the particular layers available for query. We can read the layer with the same URI to the catalog, the layer name, and the desired zoom level.
202+
203+
```python, evaluate=False
204+
gt_layer = spark.read.geotrellis(path='scheme://path-to-gt-catalog', layer=layer_name, zoom=zoom_level)
205+
```
206+
207+
This will return a RasterFrame with additional metadata inherited from the GeoTrellis TileLayerMetadata, such as the SpatialKey. The TileLayerMetadata is also stored as json in the metadata of the tile column.
208+
183209
[CRS]: concepts.md#coordinate-reference-system--crs
184210
[Extent]: concepts.md#extent
185211
[Tile]: concepts.md#tile
212+
[GeoTrellis]: https://geotrellis.readthedocs.io/en/latest/

pyrasterframes/src/main/python/docs/raster-write.pymd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ os.remove(outfile)
8989

9090
## GeoTrellis Layers
9191

92-
[GeoTrellis][GeoTrellis] is one of the key libraries that RasterFrames builds upon. It provides a Scala language API to working with large raster data with Apache Spark. Ingesting raster data into a Layer is one of the key concepts for creating a dataset for processing on Spark. RasterFrames write data from an appropriate DataFrame into a [GeoTrellis Layer](https://geotrellis.readthedocs.io/en/latest/guide/tile-backends.html). RasterFrames provides a `geotrellis` DataSource that supports both reading and writing of GeoTrellis layers.
92+
[GeoTrellis][GeoTrellis] is one of the key libraries that RasterFrames builds upon. It provides a Scala language API to working with large raster data with Apache Spark. Ingesting raster data into a Layer is one of the key concepts for creating a dataset for processing on Spark. RasterFrames writes data from an appropriate DataFrame into a [GeoTrellis Layer](https://geotrellis.readthedocs.io/en/latest/guide/tile-backends.html). RasterFrames provides a `geotrellis` DataSource that supports both @ref:[reading](raster-read.md#geotrellis) and writing of GeoTrellis layers.
9393

9494
> An example is forthcoming.
9595

pyrasterframes/src/main/python/docs/reference.pymd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ Get the cell type of the `tile`. The cell type can be changed with @ref:[rf_conv
7676

7777
Tile rf_tile(ProjectedRasterTile proj_raster)
7878

79-
Get the `tile` from the `ProjectedRasterTile` or `RasterSource` type tile column.
79+
Get the fully realized (non-lazy) `tile` from the `ProjectedRasterTile` or `RasterSource` type tile column.
8080

8181
### rf_extent
8282

pyrasterframes/src/main/python/pyrasterframes/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,4 +223,5 @@ def set_dims(parts):
223223
DataFrameReader.geotiff = lambda df_reader, path: _layer_reader(df_reader, "geotiff", path)
224224
DataFrameWriter.geotiff = _geotiff_writer
225225
DataFrameReader.geotrellis = lambda df_reader, path: _layer_reader(df_reader, "geotrellis", path)
226+
DataFrameReader.geotrellis_catalog = lambda df_reader, path: _aliased_reader(df_reader, "geotrellis-catalog", path)
226227
DataFrameWriter.geotrellis = lambda df_writer, path: _aliased_writer(df_writer, "geotrellis", path)

0 commit comments

Comments
 (0)