locationtech
diff --git a/‎pyrasterframes/src/main/python/docs/index.md‎
Lines changed: 1 addition & 1 deletion b/‎pyrasterframes/src/main/python/docs/index.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyrasterframes/src/main/python/docs/nodata-handling.pymd‎
Lines changed: 27 additions & 25 deletions b/‎pyrasterframes/src/main/python/docs/nodata-handling.pymd‎
Lines changed: 27 additions & 25 deletions
diff --git a/‎pyrasterframes/src/main/python/docs/numpy-pandas.pymd‎
Lines changed: 119 additions & 0 deletions b/‎pyrasterframes/src/main/python/docs/numpy-pandas.pymd‎
Lines changed: 119 additions & 0 deletions
diff --git a/‎pyrasterframes/src/main/python/docs/pandas-numpy.md‎
Lines changed: 0 additions & 10 deletions b/‎pyrasterframes/src/main/python/docs/pandas-numpy.md‎
Lines changed: 0 additions & 10 deletions
diff --git a/‎pyrasterframes/src/main/python/docs/raster-io.md‎
Lines changed: 3 additions & 4 deletions b/‎pyrasterframes/src/main/python/docs/raster-io.md‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎pyrasterframes/src/main/python/docs/raster-read.pymd‎
Lines changed: 3 additions & 1 deletion b/‎pyrasterframes/src/main/python/docs/raster-read.pymd‎
Lines changed: 3 additions & 1 deletion
@@ -43,7 +43,7 @@ The source code can be found on GitHub at [locationtech/rasterframes](https://gi
 * [Raster Data I/O](raster-io.md)
 * [Vector Data](vector-data.md)
 * [Raster Processing](raster-processing.md)
-* [Pandas, NumPy & RasterFrames](pandas-numpy.md)
+* [Numpy, Pandas, & RasterFrames](numpy-pandas.md) 
 * [UDF Reference](reference.md)
 * [Release Notes](release-notes.md)
@@@
 
@@ -10,8 +10,6 @@ RasterFrames provides a variety of functions to inspect and manage NoData within
 
 To understand how NoData is handled in RasterFrames, we first need to understand the different underlying types of data called cell types. The cell types are GeoTrellis `CellType`s, so the [GeoTrellis documentation](https://geotrellis.readthedocs.io/en/latest/guide/core-concepts.html?#working-with-cell-values) is a valuable resource on how these are defined.
 
-Use the function `rf_cell_type` to find the cell type of a specific set of raster data.
-
 ```python setup, echo=False
 import pyrasterframes
 from pyrasterframes.rasterfunctions import *
@@ -21,23 +19,27 @@ from IPython.display import display
 spark = pyrasterframes.get_spark_session()
 ```
 
-```python ct_from_sen
-spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif') \
-    .select(rf_cell_type('proj_raster')).distinct().show()
+The `CellType` class from the `rf_types` submodule allows us to create a representation of any valid cell type. There are convenience methods to create instances for a variety of basic types.
+
+```python celltype_ctors
+from pyrasterframes.rf_types import CellType
+import inspect
+
+[c[0] for c in inspect.getmembers(CellType, inspect.ismethod)]
 ```
 
-The function `rf_cell_types` provides a convenient list of basic cell types. Note that this list is not exhaustive. 
+We can also inspect the cell type of a given _tile_ or `proj_raster` column.
 
-```python
-rf_cell_types()
+```python ct_from_sen
+spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif') \
+    .select(rf_cell_type('proj_raster')).distinct().show()
 ```
 
 ### Understanding Cell Types and NoData
 
-Use the `CellType` class to learn more about a specific cell type.
+Use the methods on the `CellType` class to learn more about a specific cell type. Take for example the cell type of our sample data above.
 
 ```python
-from pyrasterframes.rf_types import CellType
 ct = CellType('uint16raw')
 ct, ct.is_floating_point(), ct.has_no_data()
 ```
@@ -85,30 +87,30 @@ Drawing on @ref:[local map algebra](local-algebra.md) techniques, we will create
 ```python def_mask
 from pyspark.sql.functions import lit
 
-unmasked = unmasked.withColumn('nodata', rf_local_equal('scl', lit(0)))
-unmasked = unmasked.withColumn('defect', rf_local_equal('scl', lit(1)))
-unmasked = unmasked.withColumn('cloud8', rf_local_equal('scl', lit(8)))
-unmasked = unmasked.withColumn('cloud9', rf_local_equal('scl', lit(9)))
-unmasked = unmasked.withColumn('cirrus', rf_local_equal('scl', lit(10)))
+mask_part = unmasked.withColumn('nodata', rf_local_equal('scl', lit(0))) \
+                    .withColumn('defect', rf_local_equal('scl', lit(1))) \
+                    .withColumn('cloud8', rf_local_equal('scl', lit(8))) \
+                    .withColumn('cloud9', rf_local_equal('scl', lit(9))) \
+                    .withColumn('cirrus', rf_local_equal('scl', lit(10))) 
 
-unmasked = unmasked.withColumn('mask', rf_local_add('nodata', 'defect'))
-unmasked = unmasked.withColumn('mask', rf_local_add('mask', 'cloud8'))
-unmasked = unmasked.withColumn('mask', rf_local_add('mask', 'cloud9'))
-unmasked = unmasked.withColumn('mask', rf_local_add('mask', 'cirrus'))
+one_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
+                    .withColumn('mask', rf_local_add('mask', 'cloud8')) \
+                    .withColumn('mask', rf_local_add('mask', 'cloud9')) \
+                    .withColumn('mask', rf_local_add('mask', 'cirrus'))
 
-unmasked.select(rf_cell_type('mask')).distinct().show()
+one_mask.select(rf_cell_type('mask')).distinct().show()
 ```
 
 Now we will use the @ref:[`rf_mask_by_value`](reference.md#rf-mask-by-value) to designate the cloudy and other unwanted pixels as NoData in the blue column. Because there is not a NoData already defined, we will choose one. Note that in this particular example the minimum value is greater than zero, so we can use 0 as the NoData value.
 
 ```python pick_nd
-unmasked.agg(rf_agg_stats('blue').min.alias('blue_min')).show()
+one_mask.agg(rf_agg_stats('blue').min.alias('blue_min')).show()
 ```
 
 We can now construct the cell type string for our blue band's cell type, but designating 0 as NoData.
 
 ```python get_ct_string
-blue_ct = unmasked.select(rf_cell_type('blue')).distinct().first()[0][0]
+blue_ct = one_mask.select(rf_cell_type('blue')).distinct().first()[0][0]
 masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
 masked_blue_ct.cell_type_name
 ```
@@ -117,9 +119,9 @@ Convert the cell type and apply the mask. Since the mask column is bit type, the
 
 ```python mask_blu 
 with_nd = rf_convert_cell_type('blue', masked_blue_ct.cell_type_name)
-masked = unmasked.withColumn('blue_masked', 
-            rf_mask_by_value(with_nd, 'mask', lit(1))) \
-            .drop('nodata', 'defect', 'cloud8', 'cloud9', 'cirrus', 'blue')
+masked = one_mask.withColumn('blue_masked', 
+                             rf_mask_by_value(with_nd, 'mask', lit(1))) \
+                 .drop('nodata', 'defect', 'cloud8', 'cloud9', 'cirrus', 'blue')
 ```
 
 We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the bit-type `mask` tile.
 
@@ -0,0 +1,119 @@
+# NumPy and Pandas Interoperabilty
+
+In the Python Spark API, the work of distributed computing over the DataFrame is done on many executors (the Spark term for workers) inside Java virual machines (JVM). Most calls to `pyspark` are passed to a Java process via the `py4j` library. The user can also ask for data inside the JVM to be brought over to the Python driver (the Spark term for the client application). When dealing with _tiles_, the driver will recieve this data a lightweight wrapper object around a NumPy ndarray. It is also possible to write lambda functions against NumPy arrays and evaluate them in the Spark DataFrame. 
+
+## Performance Considerations
+
+When working with large, distributed datasets in Spark care is required when invoking _actions_ on the data. In general _transformations_ are lazily evaluated in Spark, meaning the code runs fast and it doesn't move any data around. But _actions_ cause the evaluation to happen, meaning all the lazily planned _transformations_ are going to be computed and data is going to be processed and moved around. In general, if a [`pyspark` function](https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html) returns a DataFrame, it is probably a _transformation_ and otherwise it is an _action_.
+
+When many _actions_ are invoked, a lot of data can flow from executors to the driver. In `pyspark` the data then has to move from the driver JVM to the Python process running the driver. When that happens if there are any _tiles_ in the data, they will be converted to a Python [`Tile`](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/pyrasterframes/rf_types.py) object. In practical work with Earth observation data, the _tiles_ are frequently 256 by 256 arrays, which may be 100kb or more each. Individually they are small, a DataFrame can easily have dozens of such tile columns and millions of rows.
+
+All of this discussion reinforces important principle for working with Spark: understanding the cost of an _action_ is important; and use of @ref:[aggreates](aggregation.md), summaries, or samples to manage the cost of _actions_. 
+
+## The `Tile` Class
+
+In Python, _tiles_ are represented with the `rf_types.Tile` class. This is a NumPy `ndarray` with two dimensions, along with some additional metadata allowing correct conversion to the GeoTrellis @ref:[cell type](nodata-handling.md#cell-types).
+
+```python tile_intro
+from pyrasterframes.rf_types import Tile
+import numpy as np
+
+t = Tile(np.random.randn(4, 4))
+print(str(t))
+```
+
+You can access the NumPy array with the `cells` member of `Tile`.
+
+```python tile_cells
+t.cells.shape, t.cells.nbytes
+```
+
+## DataFrame `toPandas`
+
+As discussed in the @ref:[raster writing chapter](raster-write.md#dataframe-samples), a pretty display of Pandas DataFrame containing _tiles_ is available by importing the `rf_ipython` submodule. In addition, as discussed in the @ref:[vector data chapter](vector-data.md), any geometry type in the Spark dataframe will be converted into a Shapely geometry. Taken together, we can can easily get the spatial information and raster data as a NumPy array, all within a Pandas DataFrame.
+
+```python spark_session,  echo=False
+import pyrasterframes
+from pyrasterframes.rasterfunctions import *
+from IPython.display import display
+spark = pyrasterframes.get_spark_session()
+
+```
+
+```python toPandas
+import pyrasterframes.rf_ipython
+from pyspark.sql.functions import lit, col
+
+cat = spark.read.format('aws-pds-modis-catalog').load() \
+                .filter(
+                    (col('granule_id') == 'h11v04') &
+                    (col('acquisition_date') > lit('2018-02-19')) &
+                    (col('acquisition_date') < lit('2018-02-22'))
+                ) 
+
+spark_df = spark.read.raster(catalog=cat, catalog_col_names=['B01']) \
+                .select(
+                    'acquisition_date',
+                    'granule_id',
+                    rf_tile('B01').alias('tile'),
+                    rf_geometry('B01').alias('tile_geom')
+                    )
+
+pandas_df = spark_df.limit(10).toPandas()
+pandas_df.iloc[0].apply(lambda v: type(v))
+```
+
+## User Defined Functions
+
+As we demonstrated with @ref:[vector data](vector-data.md#shapely-geometry-support), we can also make use of the `Tile` type to create [user-defined functions (UDF)](https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.functions.udf) that can take a _tile_ as input, return a _tile_ as output, or both. Here is a trivial and **inefficient** example of doing both. A serious performance implication of user defined functions in Python is that all the executors must move the Java objects to Python, evaluate the function, then move the Python objects back to Java. Use the many @ref:[built-in functions](reference.md) wherever possible, and ask the [community](https://gitter.im/s22s/raster-frames) if you have an idea for a function that should be included.
+
+We will demonstrate creating a UDF that is logically equivalent to a built in function. We'll quickly show that the resulting _tiles_ are approximately equivalent. The reason they are not exactly the same is because one is computed in Python and the other is computed in Java.
+
+```python udf
+from pyrasterframes.rf_types import TileUDT
+from pyspark.sql.functions import udf
+
+@udf(TileUDT())
+def my_udf(t):
+    import numpy as np
+    return Tile(np.log1p(t.cells))
+
+udf_df = spark_df.limit(1).select(
+            my_udf('tile').alias('udf_result'),
+            rf_log1p('tile').alias('built_in_result')
+        ).toPandas()
+
+row = udf_df.iloc[0]
+diff = row['udf_result'] - row['built_in_result']
+print(type(diff))
+np.abs(diff.cells).max()
+```
+
+We can also inspect an image of the difference between the two _tiles_; it is just random noise. Both tiles have the same structure of NoData, the white areas.
+
+```python udf_diff_noise_tile
+display(diff)
+```
+
+## Creating a Spark DataFrame
+
+You can also create a Spark DataFrame with a column full of `Tile` objects or Shapely geomtery objects.
+
+The example below will create a Pandas DataFrame with ten rows of noise tiles and random Points. We will then create a Spark DataFrame from it.
+
+```python create_spark_df
+import pandas as pd
+from shapely.geometry import Point
+
+pandas_df = pd.DataFrame([
+        { 
+            'tile': Tile(np.random.randn(100, 100)), 
+            'geom': Point(-90 + 90 * np.random.random((2, 1))) 
+        }  for _ in range(10)
+    ])
+
+spark_df = spark.createDataFrame(pandas_df)
+
+spark_df.printSchema()
+spark_df.count()
+```
@@ -2,19 +2,18 @@
 
 The standard mechanism by which any data is brought in and out of a Spark Dataframe is the [Spark SQL DataSource][DS]. RasterFrames provides specialized DataSources for geospatial raster data and maintains compatibility with existing general purpose DataSources, such as Parquet.
 
-Three types of DataSources will be introduced: 
-
 * @ref:[Catalog Readers](raster-catalogs.md)
     - `aws-pds-l8-catalog`: built-in catalog over [Landsat on AWS][Landsat]
     - `aws-pds-modis-catalog`: built-in catalog over [MODIS on AWS][MODIS]
     - `geotrellis-catalog`: for enumerating [GeoTrellis layers][GTLayer]
 * @ref:[Raster Readers](raster-read.md)
-    - `raster`: the standard reader for most raster data
+    - `raster`: the standard reader for most raster data, including single raster files or catalogs
     - `geotiff`: a simplified reader for reading a single GeoTIFF file
     - `geotrellis`: for reading a [GeoTrellis layer][GTLayer])
 * @ref:[Raster Writers](raster-write.md)
-    - `geotrellis`: for creating a [GeoTrellis layer][GTLayer]
+    - @ref:[Tile](raster-write.md#tile-samples) and @ref:[DataFrame](raster-write.md#dataframe-samples) samples
     - `geotiff`: beta writer to GeoTiff file
+    - `geotrellis`: creating a [GeoTrellis layer][GTLayer]
     - [`parquet`][Parquet]: general purpose writer 
 
 There is also support for @ref:[vector data](vector-data.md) for masking and data labeling.
 
@@ -12,13 +12,15 @@ RasterFrames registers a DataSource named `raster` that enables reading of GeoTI
 
 ## Single Raster
 
-The simplest form is reading a single raster from a single URI:
+The simplest form is reading a single raster from a single URI.
 
 ```python read_one_uri
 rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif')
 rf.printSchema()
 ```
 
+The file at the address above is a valid [Cloud Optimized GeoTIFF (COG)](https://www.cogeo.org/), which RasterFrames fully supports. RasterFrames will take advantage of the optimizations in the COG format to enable more efficient reading compared to vanilla GeoTIFFs.  
+
 Let's unpack the `proj_raster` column and look at the contents in more detail. It contains a [_CRS_][CRS], a spatial _extent_ measured in that CRS, and a two-dimensional array of numeric values called a _tile_. 
 
 ```python unpack_schema