You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/aggregation.pymd
+11-11Lines changed: 11 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Aggregation
2
2
3
-
```python, echo=False
3
+
```python, setup, echo=False
4
4
from docs import *
5
5
from pyrasterframes.utils import create_rf_spark_session
6
6
from pyrasterframes.rasterfunctions import *
@@ -16,7 +16,7 @@ There are 3 types of aggregate functions: _tile_ aggregate, DataFrame aggregate,
16
16
17
17
We can illustrate these differences in computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
In this code block, we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
In this code block, we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
In this code block, we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
44
44
45
45
To compute an element-wise local aggregate, _tiles_ need have the same dimensions as in the example below where both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are 3,842,290 data cells and 1,941,734 NoData cells in this DataFrame. See section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the _tiles_ in a DataFrame and returns a statistical summary of all cell values as shown below.
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks, has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
85
85
86
-
```python
86
+
```python, agg_local_stats
87
87
rf = spark.sql("""
88
88
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
89
89
UNION
@@ -103,7 +103,7 @@ for r in agg_local_stats:
103
103
104
104
The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted `value` on the x-axis and `count` on the y-axis to create the histogram. There are 100 rows of _tile_ in this DataFrame, but this histogram is just computed for the _tile_ in the first row.
The @ref:[`rf_agg_approx_histogram`](reference.md#rf-agg-approx-histogram) function computes a count of cell values across all of the rows of _tile_ in a DataFrame or group. In the example below, the range of the y-axis is significantly wider than the range of the y-axis on the previous histogram since this histogram was computed for all cell values in the DataFrame.
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/unsupervised-learning.pymd
+13-13Lines changed: 13 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ In this example, we will demonstrate how to fit and score an unsupervised learni
4
4
5
5
## Imports and Data Preparation
6
6
7
-
```python, echo=False
7
+
```python, setup, echo=False
8
8
from IPython.core.display import display
9
9
from docs import resource_dir_uri
10
10
from pyrasterframes.utils import create_rf_spark_session
@@ -18,7 +18,7 @@ import pandas as pd
18
18
19
19
We import various Spark components that we need to construct our `Pipeline`.
20
20
21
-
```python, echo=True
21
+
```python, imports, echo=True
22
22
from pyrasterframes import TileExploder
23
23
from pyrasterframes.rasterfunctions import rf_assemble_tile, rf_crs, rf_extent, rf_tile, rf_dimensions
24
24
@@ -31,7 +31,7 @@ from pyspark.ml import Pipeline
31
31
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The function `resource_dir_uri` gives a local file system path to the sample Landsat data. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
32
32
33
33
34
-
```python, term=True
34
+
```python, catalog, term=True
35
35
filenamePattern = "L8-B{}-Elkton-VA.tiff"
36
36
catalog_df = pd.DataFrame([
37
37
{'b' + str(b): os.path.join(resource_dir_uri(), filenamePattern.format(b)) for b in range(1, 8)}
@@ -55,54 +55,54 @@ df.printSchema()
55
55
56
56
SparkML requires that each observation be in its own row, and features for each observation be packed into a single `Vector`. For this unsupervised learning problem, we will treat each _pixel_ as an observation and each band as a feature. The first step is to "explode" the _tiles_ into a single row per pixel. In RasterFrames, generally a pixel is called a @ref:[`cell`](concepts.md#cell).
57
57
58
-
```python
58
+
```python, exploder
59
59
exploder = TileExploder()
60
60
```
61
61
62
62
To "vectorize" the the band columns, we use the SparkML `VectorAssembler`. Each of the seven bands is a different feature.
63
63
64
-
```python
64
+
```python, assembler
65
65
assembler = VectorAssembler() \
66
66
.setInputCols(list(catalog_df.columns)) \
67
67
.setOutputCol("features")
68
68
```
69
69
70
70
For this problem, we will use the K-means clustering algorithm and configure our model to have 5 clusters.
Fitting the _pipeline_ actually executes exploding the _tiles_, assembling the features _vectors_, and fitting the K-means clustering model.
85
85
86
-
```python
86
+
```python, fit
87
87
model = pipeline.fit(df)
88
88
```
89
89
90
90
We can use the `transform` function to score the training data in the fitted _pipeline_ model. This will add a column called `prediction` with the closest cluster identifier.
91
91
92
-
```python
92
+
```python, transform
93
93
clustered = model.transform(df)
94
94
clustered.show(8)
95
95
```
96
96
97
97
If we want to inspect the model statistics, the SparkML API requires us to go through this unfortunate contortion:
98
98
99
-
```python
99
+
```python, cluster_stats
100
100
cluster_stage = model.stages[2]
101
101
```
102
102
103
103
We can then compute the sum of squared distances of points to their nearest center, which is elemental to most cluster quality metrics.
104
104
105
-
```python
105
+
```python, distance
106
106
metric = cluster_stage.computeCost(clustered)
107
107
print("Within set sum of squared errors: %s" % metric)
108
108
```
@@ -111,7 +111,7 @@ print("Within set sum of squared errors: %s" % metric)
111
111
112
112
We can recreate the tiled data structure using the metadata added by the `TileExploder` pipeline stage.
@@ -24,7 +24,7 @@ The properties of each feature are available as columns of the DataFrame, along
24
24
25
25
You can also convert a [GeoPandas][GeoPandas] GeoDataFrame to a Spark DataFrame, preserving the geometry column. This means that any vector format that can be read with [OGR][OGR] can be converted to a Spark DataFrame. In the example below, we expect the same schema as `df` defined above by the GeoJSON reader. Note that in a GeoPandas DataFrame there can be heterogeneous geometry types in the column, but this may fail Spark's schema inference.
26
26
27
-
```python
27
+
```python, read_and_normalize
28
28
import geopandas
29
29
from shapely.geometry import MultiPolygon
30
30
@@ -45,20 +45,20 @@ df2.printSchema()
45
45
46
46
The `geometry` column will have a Spark user-defined type that is compatible with [Shapely][Shapely] when working on the Python side. This means that when the data is collected to the driver, it will be a Shapely geometry object.
47
47
48
-
```python
48
+
```python, show_geom
49
49
the_first = df.first()
50
50
print(type(the_first['geometry']))
51
51
```
52
52
53
53
Since it is a geometry we can do things like this:
54
54
55
-
```python
55
+
```python, show_wkt
56
56
the_first['geometry'].wkt
57
57
```
58
58
59
59
You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry.
60
60
61
-
```python
61
+
```python, add_centroid
62
62
from pyspark.sql.functions import udf
63
63
from geomesa_pyspark.types import PointUDT
64
64
@@ -72,7 +72,7 @@ df.printSchema()
72
72
73
73
We can take a look at a sample of the data. Notice the geometry columns print as well known text (wkt).
74
74
75
-
```python
75
+
```python, show_centroid
76
76
df.show(4)
77
77
```
78
78
@@ -82,7 +82,7 @@ df.show(4)
82
82
As documented in the @ref:[function reference](reference.md), various user-defined functions implemented by GeoMesa are also available for use. The example below uses a GeoMesa user-defined function to compute the centroid of a geometry. It is logically equivalent to the example above, but more efficient.
83
83
84
84
85
-
```python
85
+
```python, native_centroid
86
86
from pyrasterframes.rasterfunctions import st_centroid
0 commit comments