Skip to content

Commit 0da9aa9

Browse files
authored
Merge pull request #540 from s22s/docs/write-geotiff-in-ML-examples
update docs on geotiff writing
2 parents 1c1995f + b903103 commit 0da9aa9

File tree

2 files changed

+43
-28
lines changed

2 files changed

+43
-28
lines changed

pyrasterframes/src/main/python/docs/raster-write.pymd

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -53,11 +53,29 @@ ov
5353

5454
## GeoTIFFs
5555

56-
GeoTIFF is one of the most common file formats for spatial data, providing flexibility in data encoding, representation, and storage. RasterFrames provides a specialized Spark DataFrame writer for rendering a RasterFrame to a GeoTIFF.
56+
GeoTIFF is one of the most common file formats for spatial data, providing flexibility in data encoding, representation, and storage. RasterFrames provides a specialized Spark DataFrame writer for rendering a RasterFrame to a GeoTIFF. It is accessed by calling `dataframe.write.geotiff`.
5757

58-
One downside to GeoTIFF is that it is <b><u>not</u></b> a big-data native format. To create a GeoTIFF, all the data to be encoded has to be in the memory of one computer (in Spark parlance, this is a "collect"), limiting its maximum size substantially compared to that of a full cluster environment. When rendering GeoTIFFs in RasterFrames, you must either specify the dimensions of the output raster, or deliberately limit the size of the collected data.
58+
### Limitations and mitigations
5959

60-
Fortunately, we can use the cluster computing capability to downsample the data into a more manageable size. For sake of example, let's render an overview of a scene's red band as a small raster, reprojecting it to latitude and longitude coordinates on the [WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System) reference ellipsoid (aka [EPSG:4326](https://spatialreference.org/ref/epsg/4326/)).
60+
One downside to GeoTIFF is that it is <b><u>not</u></b> a big-data native format. To create a GeoTIFF, all the data to be written must be `collect`ed in the memory of the Spark driver. This means you must actively limit the size of the data to be written. It is trivial to lazily read a set of inputs that cannot feasibly be written to GeoTIFF in the same environment.
61+
62+
When writing GeoTIFFs in RasterFrames, you should limit the size of the collected data. Consider filtering the dataframe by time or @ref:[spatial filters](vector-data.md#geomesa-functions-and-spatial-relations).
63+
64+
You can also specify the dimensions of the GeoTIFF file to be written using the `raster_dimensions` parameter as described below.
65+
66+
### Parameters
67+
68+
If there are many _tile_ or projected raster columns in the DataFrame, the GeoTIFF writer will write each one as a separate band in the file. Each band in the output will be tagged the input column names for reference.
69+
70+
* `path`: the path local to the driver where the file will be written
71+
* `crs`: the PROJ4 string of the CRS the GeoTIFF is to be written in
72+
* `raster_dimensions`: optional, a tuple of two ints giving the size of the resulting file. If specified, RasterFrames will downsample the data in distributed fashion using bilinear resampling. If not specified, the default is to write the dataframe at full resolution, which can result in an `OutOfMemoryError`.
73+
74+
### Example
75+
76+
See also the example in the @ref:[unsupervised learning page](unsupervised-learning.md).
77+
78+
Let's render an overview of a scene's red band as a small raster, reprojecting it to latitude and longitude coordinates on the [WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System) reference ellipsoid (aka [EPSG:4326](https://spatialreference.org/ref/epsg/4326/)).
6179

6280
```python write_geotiff
6381
outfile = os.path.join('/tmp', 'geotiff-overview.tif')
@@ -78,14 +96,9 @@ with rasterio.open(outfile) as src:
7896
histtype='stepfilled', title="Overview Histogram")
7997
```
8098

81-
If there are many _tile_ or projected raster columns in the DataFrame, the GeoTIFF writer will write each one as a separate band in the file. Each band in the output will be tagged the input column names for reference.
82-
83-
@@@ note
84-
If no `raster_dimensions` column is specified the DataFrame contents are written at full resolution. As shown in the example above, you can also specify the size of the output GeoTIFF. Bilinear resampling is used.
85-
@@@
8699

87100
@@@ warning
88-
Attempting to write a full resolution GeoTIFF constructed from multiple scenes is likely to result in an out of memory error. The `raster_dimensions` parameter needs to be used in these cases.
101+
Attempting to write a full resolution GeoTIFF constructed from multiple scenes is likely to result in an out of memory error. Consider filtering the dataframe more aggressively and using a smaller value for the `raster_dimensions` parameter.
89102
@@@
90103

91104
### Color Composites

pyrasterframes/src/main/python/docs/unsupervised-learning.pymd

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -30,19 +30,28 @@ from pyspark.ml import Pipeline
3030

3131
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
3232

33-
3433
```python, catalog
3534
filenamePattern = "https://rasterframes.s3.amazonaws.com/samples/elkton/L8-B{}-Elkton-VA.tiff"
3635
catalog_df = pd.DataFrame([
3736
{'b' + str(b): filenamePattern.format(b) for b in range(1, 8)}
3837
])
3938

40-
df = spark.read.raster(catalog_df, catalog_col_names=catalog_df.columns)
39+
tile_size = 256
40+
df = spark.read.raster(catalog_df, catalog_col_names=catalog_df.columns, tile_size=tile_size)
4141
df = df.withColumn('crs', rf_crs(df.b1)) \
4242
.withColumn('extent', rf_extent(df.b1))
4343
df.printSchema()
4444
```
4545

46+
In this small example, all the images in our `catalog_df` have the same @ref:[CRS](concepts.md#coordinate-reference-system-crs-), which we verify in the code snippet below. The `crs` object will be useful for visualization later.
47+
48+
```python, crses
49+
crses = df.select('crs.crsProj4').distinct().collect()
50+
print('Found ', len(crses), 'distinct CRS: ', crses)
51+
assert len(crses) == 1
52+
crs = crses[0]['crsProj4']
53+
```
54+
4655
## Create ML Pipeline
4756

4857
SparkML requires that each observation be in its own row, and features for each observation be packed into a single `Vector`. For this unsupervised learning problem, we will treat each _pixel_ as an observation and each band as a feature. The first step is to "explode" the _tiles_ into a single row per pixel. In RasterFrames, generally a pixel is called a @ref:[`cell`](concepts.md#cell).
@@ -51,7 +60,7 @@ SparkML requires that each observation be in its own row, and features for each
5160
exploder = TileExploder()
5261
```
5362

54-
To "vectorize" the the band columns, we use the SparkML `VectorAssembler`. Each of the seven bands is a different feature.
63+
To "vectorize" the band columns, we use the SparkML `VectorAssembler`. Each of the seven bands is a different feature.
5564

5665
```python, assembler
5766
assembler = VectorAssembler() \
@@ -111,29 +120,22 @@ We can recreate the tiled data structure using the metadata added by the `TileEx
111120
```python, assemble
112121
from pyrasterframes.rf_types import CellType
113122

114-
tile_dims = df.select(rf_dimensions(df.b1).alias('dims')).first()['dims']
115123
retiled = clustered.groupBy('extent', 'crs') \
116124
.agg(
117125
rf_assemble_tile('column_index', 'row_index', 'prediction',
118-
tile_dims['cols'], tile_dims['rows'], CellType.int8()).alias('prediction')
126+
tile_size, tile_size, CellType.int8())
119127
)
120128
```
121129

122-
The resulting output is shown below.
130+
Next we will @ref:[write the output to a GeoTiff file](raster-write.md#geotiffs). Doing so in this case works quickly and well for a few specific reasons that may not hold in all cases. We can write the data at full resolution, by omitting the `raster_dimensions` argument, because we know the input raster dimensions are small. Also, the data is all in a single CRS, as we demonstrated above. Because the `catalog_df` is only a single row, we know the output GeoTIFF value at a given location corresponds to a single input. Finally, the `retiled` `DataFrame` only has a single `Tile` column, so the band interpretation is trivial.
123131

124132
```python, viz
125-
from pyrasterframes.rf_types import Extent
126-
aoi = Extent.from_row(
127-
retiled.agg(rf_agg_reprojected_extent('extent', 'crs', 'epsg:3857')) \
128-
.first()[0]
129-
)
130-
131-
retiled.select(rf_agg_overview_raster('prediction', 558, 507, aoi, 'extent', 'crs'))
132-
```
133+
output_tif = 'unsupervised.tif'
133134

135+
retiled.write.geotiff(output_tif, crs=crs)
134136

135-
```python, viz-true-color, evaluate=False, echo=False
136-
#For comparison, the true color composite of the original data.
137-
# this is really dark
138-
df.select(rf_render_png('b4', 'b3', 'b2'))
139-
```
137+
with rasterio.open(output_tif) as src:
138+
for b in range(1, src.count + 1):
139+
print("Tags on band", b, src.tags(b))
140+
show(src)
141+
```

0 commit comments

Comments
 (0)