Skip to content

Commit c5e59b9

Browse files
committed
Edits: raster-write.pymd & vector-data.pymd.
1 parent 240f878 commit c5e59b9

File tree

3 files changed

+18
-16
lines changed

3 files changed

+18
-16
lines changed

pyrasterframes/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ To manually initialize PyRasterFrames in a `pyspark` shell, prepare to call pysp
6262

6363
```
6464

65-
Then in the pyspark shell or app, import the module and call `withRasterFrames` on the SparkSession.
65+
Then in the PySpark shell or script, import the module and call `withRasterFrames` on the SparkSession.
6666

6767
```python
6868
from pyrasterframes.utils import create_rf_spark_session

pyrasterframes/src/main/python/docs/raster-write.pymd

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Writing Raster Data
22

3-
RasterFrames is oriented toward large scale analyses of spatial data. The primary output of these analyses could be a @ref:[statistical summary](aggregation.md), a @ref:[machine learning model](machine-learning.md), or some other result that is generally much smaller than the input data set.
3+
RasterFrames is oriented toward large scale analyses of spatial data. The primary output of these analyses could be a @ref:[statistical summary](aggregation.md), a @ref:[machine learning model](machine-learning.md), or some other result that is generally much smaller than the input dataset.
44

5-
However, there are times in any analysis where writing a representative sample of the work in progress provides invaluable feedback on the current state of the process and results.
5+
However, there are times in any analysis where writing a representative sample of the work in progress provides valuable feedback on the current state of the process and results.
66

77
```python imports, echo=False
88
import pyrasterframes
@@ -15,7 +15,7 @@ spark = pyrasterframes.get_spark_session()
1515

1616
## Tile Samples
1717

18-
When collecting a _tile_ (see discussion of the RasterFrame @ref:[schema](raster-read.md#single-raster) for orientation to the concept) to the Python Spark driver, we have some convenience methods to quickly visualize the _tile_.
18+
We have some convenience methods to quickly visualize _tile_s (see discussion of the RasterFrame @ref:[schema](raster-read.md#single-raster) for orientation to the concept) when inspecting a subset of the data in a Notebook.
1919

2020
In an IPython or Jupyter interpreter, a `Tile` object will be displayed as an image with limited metadata.
2121

@@ -37,7 +37,7 @@ display(tile) # IPython.display function
3737

3838
Within an IPython or Jupyter interpreter, a Pandas DataFrame containing a column of _tiles_ will be rendered as the samples discussed above. Simply import the `rf_ipython` submodule to enable enhanced HTML rendering of a Pandas DataFrame.
3939

40-
In the example below, notice the result is limited to a small subset. For more discussion about why this is important, see the @ref:[Pandas and NumPy discussion](numpy-pandas.md).
40+
In the example below, notice the result is limited to a small subset. For more information about why this is important, see the @ref:[Pandas and NumPy discussion](numpy-pandas.md).
4141

4242
```python to_pandas, evaluate=True
4343
import pyrasterframes.rf_ipython
@@ -60,23 +60,25 @@ pandas_df
6060

6161
GeoTIFF is one of the most common file formats for spatial data, providing flexibility in data encoding, representation, and storage. RasterFrames provides a specialized Spark DataFrame writer for rendering a RasterFrame to a GeoTIFF.
6262

63-
One downside to GeoTIFF is that it is not a big data native format. To create a GeoTIFF all the data to be encoded has to be in the memory of one compute node (in Spark parlance, this is a "collect"), limiting it's maximum size substantially compared to that of a full cluster environment. When rendering GeoTIFFs in RasterFrames, you either need to specify the dimensions of the output raster, or be aware of how big the collected data will end up being.
63+
One downside to GeoTIFF is that it is not a big data native format. To create a GeoTIFF, all the data to be encoded has to be in the memory of one computer (in Spark parlance, this is a "collect"), limiting it's maximum size substantially compared to that of a full cluster environment. When rendering GeoTIFFs in RasterFrames, you must either specify the dimensions of the output raster, or deliberately limit the size of the collected data.
6464

65-
Fortunately, we can use the cluster computing capability to downsample the data into a more manageable size. For sake of example, let's render an overview our scene's red band as a small raster, reprojecting it to latitude and longitude coordinates on the [WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System) reference ellipsoid (aka [EPSG:4326](https://spatialreference.org/ref/epsg/4326/)).
65+
Fortunately, we can use the cluster computing capability to downsample the data into a more manageable size. For sake of example, let's render an overview of a scene's red band as a small raster, reprojecting it to latitude and longitude coordinates on the [WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System) reference ellipsoid (aka [EPSG:4326](https://spatialreference.org/ref/epsg/4326/)).
6666

6767
```python write_geotiff
6868
outfile = os.path.join('/tmp', 'geotiff-overview.tif')
6969
spark_df.write.geotiff(outfile, crs='EPSG:4326', raster_dimensions=(256, 256))
7070
```
7171

72-
View it with `rasterio` to check the results:
72+
We can view the written file with `rasterio`:
7373

7474
```python view_geotiff
7575
import rasterio
7676
from rasterio.plot import show, show_hist
7777

7878
with rasterio.open(outfile) as src:
79+
# View raster
7980
show(src, adjust='linear')
81+
# View data distribution
8082
show_hist(src, bins=50, lw=0.0, stacked=False, alpha=0.6,
8183
histtype='stepfilled', title="Overview Histogram")
8284
```
@@ -89,13 +91,13 @@ os.remove(outfile)
8991

9092
## GeoTrellis Layers
9193

92-
[GeoTrellis][GeoTrellis] is one of the key libraries that RasterFrames builds upon. It provides a Scala language API to working with large raster data with Apache Spark. Ingesting raster data into a Layer is one of the key concepts for creating a dataset for processing on Spark. RasterFrames writes data from an appropriate DataFrame into a [GeoTrellis Layer](https://geotrellis.readthedocs.io/en/latest/guide/tile-backends.html). RasterFrames provides a `geotrellis` DataSource that supports both @ref:[reading](raster-read.md#geotrellis) and writing of GeoTrellis layers.
94+
[GeoTrellis][GeoTrellis] is one of the key libraries upon which RasterFrames is built. It provides a Scala language API for working with geospatial raster data. GeoTrellis defines a [tile layer storage](https://geotrellis.readthedocs.io/en/latest/guide/tile-backends.html) format for persisting imagery mosaics. RasterFrames can write data from a `RasterFrameLayer` into a [GeoTrellis Layer](https://geotrellis.readthedocs.io/en/latest/guide/tile-backends.html). RasterFrames provides a `geotrellis` DataSource that supports both @ref:[reading](raster-read.md#geotrellis-layers) and @ref:[writing](raster-write.md#geotrellis-layers) GeoTrellis layers.
9395

94-
> An example is forthcoming.
96+
> An example is forthcoming. In the mean time referencing the [`GeoTrellisDataSourceSpec` test code](https://github.com/locationtech/rasterframes/blob/develop/datasource/src/test/scala/org/locationtech/rasterframes/datasource/geotrellis/GeoTrellisDataSourceSpec.scala) may help
9597

9698
## Parquet
9799

98-
You can write the Spark DataFrame to an [Apache Parquet][Parquet] "file". This format is designed to work across different projects in the Hadoop ecosystem. It also provides a variety of optimizations for query against data written in the format.
100+
You can write a RasterFrame to the [Apache Parquet][Parquet] format. This format is designed to efficiently persist and query columnar data in distributed file system, such as HDFS. It also provides benefits when working in single node (or "local") mode, such as tailoring organization for defined query patterns.
99101

100102
```python write_parquet, evaluate=False
101103
spark_df.withColumn('exp', rf_expm1('proj_raster')) \

pyrasterframes/src/main/python/docs/vector-data.pymd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Vector Data
22

3-
RasterFrames provides a variety of ways to work with spatial vector (points, lines, and polygons) data alongside raster data. There is a convenience DataSource for GeoJSON format, as well as the ability to convert from [GeoPandas][GeoPandas] to Spark. Representation of vector geometries in pyspark is through [Shapely][Shapely] which provides a great deal of flexibility. RasterFrames also provides access to Spark functions for working with geometries.
3+
RasterFrames provides a variety of ways to work with spatial vector data (points, lines, and polygons) alongside raster data. There is a convenience DataSource for the GeoJSON format, as well as the ability to convert from [GeoPandas][GeoPandas] to Spark. Representation of vector geometries in PySpark is through [Shapely][Shapely], providing a great deal of interoperability. RasterFrames also provides access to Spark functions for working with geometries.
44

55
## GeoJSON DataSource
66

@@ -19,11 +19,11 @@ df = spark.read.geojson(SparkFiles.get('admin1-us.geojson'))
1919
df.printSchema()
2020
```
2121

22-
The properties of each feature are available as columns of the DataFrame, along with the geometry.
22+
The properties of each discrete geometry are available as columns of the DataFrame, along with the geometry itself.
2323

2424
## GeoPandas and RasterFrames
2525

26-
You can also convert a [GeoPandas][GeoPandas] GeoDataFrame to a Spark DataFrame, preserving the geometry column. This means that any vector format that can be read with [OGR][OGR] can be converted to a Spark DataFrame. In the example below, we expect the same schema as `df` defined above by the GeoJSON reader. Note that in a GeoPandas DataFrame there can be heterogeneous geometry types in the column, but this may fail Spark's schema inference.
26+
You can also convert a [GeoPandas][GeoPandas] GeoDataFrame to a Spark DataFrame, preserving the geometry column. This means that any vector format that can be read with [OGR][OGR] can be converted to a Spark DataFrame. In the example below, we expect the same schema as the DataFrame defined above by the GeoJSON reader. Note that in a GeoPandas DataFrame there can be heterogeneous geometry types in the column, which may fail Spark's schema inference.
2727

2828
```python, read_and_normalize
2929
import geopandas
@@ -44,7 +44,7 @@ df2.printSchema()
4444

4545
## Shapely Geometry Support
4646

47-
The `geometry` column will have a Spark user-defined type that is compatible with [Shapely][Shapely] when working on the Python side. This means that when the data is collected to the driver, it will be a Shapely geometry object.
47+
The `geometry` column will have a Spark user-defined type that is compatible with [Shapely][Shapely] when working with Python via PySpark. This means that when the data is collected to the driver, it will be a Shapely geometry object.
4848

4949
```python, show_geom
5050
the_first = df.first()
@@ -89,7 +89,7 @@ df = df.withColumn('centroid', st_centroid(df.geometry))
8989
df.select('name', 'geometry', 'naive_centroid', 'centroid').show(4)
9090
```
9191

92-
The RasterFrames vector functions and GeoMesa functions also provide a variety of spatial relations that are useful in combination with the geometric properties of projected rasters. In this example, we use the @ref:[built-in Landsat catalog](raster-catalogs.md#using-built-in-experimental-catalogs) which provides an extent. We will convert the extent to a polygon and filter only those within approximately 500 km of a selected point.
92+
The RasterFrames vector functions and GeoMesa functions also provide a variety of spatial relations that are useful in combination with the geometric properties of projected rasters. In this example, we use the @ref:[built-in Landsat catalog](raster-catalogs.md#using-built-in-experimental-catalogs) which provides an extent. We will convert the extent to a polygon and filter to those within approximately 500 km of a selected point.
9393

9494
```python, evaluate=True
9595
from pyrasterframes.rasterfunctions import st_geometry, st_bufferPoint, st_intersects, st_point

0 commit comments

Comments
 (0)