Skip to content

Commit f6fdcd4

Browse files
committed
Misc docs tweaking.
1 parent 400b832 commit f6fdcd4

File tree

7 files changed

+42
-39
lines changed

7 files changed

+42
-39
lines changed

build.sbt

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -141,9 +141,12 @@ lazy val docs = project
141141
),
142142
paradoxNavigationExpandDepth := Some(3),
143143
paradoxTheme := Some(builtinParadoxTheme("generic")),
144-
makeSite := makeSite.dependsOn(Compile / unidoc).dependsOn(Compile / paradox).value,
144+
makeSite := makeSite
145+
.dependsOn(Compile / unidoc)
146+
.dependsOn((Compile / paradox)
147+
.dependsOn(pyrasterframes / doc)
148+
).value,
145149
Compile / paradox / sourceDirectories += (pyrasterframes / Python / doc / target).value,
146-
Compile / paradox := (Compile / paradox).dependsOn(pyrasterframes / doc).value
147150
)
148151
.settings(
149152
addMappingsToSiteDir(ScalaUnidoc / packageDoc / mappings, ScalaUnidoc / siteSubdirName)

core/src/main/scala/org/locationtech/rasterframes/model/FixedRasterExtent.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ package org.locationtech.rasterframes.model
1919
import geotrellis.raster._
2020
import geotrellis.vector._
2121

22-
import scala.math.{ceil, max, min}
22+
import scala.math.ceil
2323

2424
/**
2525
* This class is a copy of the GeoTrellis 2.x `RasterExtent`,

pyrasterframes/src/main/python/docs/aggregation.pymd

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Aggregation
22

3-
```python, echo=False
3+
```python, setup, echo=False
44
from docs import *
55
from pyrasterframes.utils import create_rf_spark_session
66
from pyrasterframes.rasterfunctions import *
@@ -16,7 +16,7 @@ There are 3 types of aggregate functions: _tile_ aggregate, DataFrame aggregate,
1616

1717
We can illustrate these differences in computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
1818

19-
```python
19+
```python, sql_dataframe
2020
import pyspark.sql.functions as F
2121

2222
rf = spark.sql("""
@@ -30,29 +30,29 @@ rf.select("id", rf_render_matrix("tile")).show(10, False)
3030

3131
In this code block, we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
3232

33-
```python
33+
```python, tile_mean
3434
rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show(10, False)
3535
```
3636

3737
In this code block, we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
3838

39-
```python
39+
```python, agg_mean
4040
rf.agg(rf_agg_mean(F.col('tile'))).show(10, False)
4141
```
4242

4343
In this code block, we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
4444

4545
To compute an element-wise local aggregate, _tiles_ need have the same dimensions as in the example below where both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
4646

47-
```python
47+
```python, local_mean
4848
rf.agg(rf_agg_local_mean(F.col('tile')).alias("local_mean")).select(rf_render_matrix("local_mean")).show(10, False)
4949
```
5050

5151
## Cell Counts Example
5252

5353
We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are 3,842,290 data cells and 1,941,734 NoData cells in this DataFrame. See section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
5454

55-
```python
55+
```python, cell_counts
5656
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
5757
stats = rf.agg(rf_agg_data_cells('proj_raster'), rf_agg_no_data_cells('proj_raster'))
5858

@@ -65,7 +65,7 @@ The statistical summary functions return a summary of cell values: number of dat
6565

6666
The @ref:[`rf_tile_stats`](reference.md#rf-tile-stats) function computes summary statistics separately for each row in a _tile_ column as shown below.
6767

68-
```python
68+
```python, tile_stats
6969
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif')
7070
stats = rf.select(rf_tile_stats('proj_raster').alias('stats'))
7171

@@ -75,15 +75,15 @@ stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').show(10,
7575

7676
The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the _tiles_ in a DataFrame and returns a statistical summary of all cell values as shown below.
7777

78-
```python
78+
```python, agg_stats
7979
rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
8080
.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance') \
8181
.show(10, False)
8282
```
8383

8484
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks, has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
8585

86-
```python
86+
```python, agg_local_stats
8787
rf = spark.sql("""
8888
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
8989
UNION
@@ -103,7 +103,7 @@ for r in agg_local_stats:
103103

104104
The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted `value` on the x-axis and `count` on the y-axis to create the histogram. There are 100 rows of _tile_ in this DataFrame, but this histogram is just computed for the _tile_ in the first row.
105105

106-
```python
106+
```python, tile_histogram
107107
import matplotlib.pyplot as plt
108108

109109
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
@@ -121,7 +121,7 @@ plt.show()
121121

122122
The @ref:[`rf_agg_approx_histogram`](reference.md#rf-agg-approx-histogram) function computes a count of cell values across all of the rows of _tile_ in a DataFrame or group. In the example below, the range of the y-axis is significantly wider than the range of the y-axis on the previous histogram since this histogram was computed for all cell values in the DataFrame.
123123

124-
```python
124+
```python, agg_histogram
125125
bins_list = rf.agg(
126126
rf_agg_approx_histogram('proj_raster')['bins'].alias('bins')
127127
).collect()

pyrasterframes/src/main/python/docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ The source code can be found on GitHub at [locationtech/rasterframes](https://gi
1616

1717
## Detailed Contents
1818

19-
@@ toc { depth=2 }
19+
@@ toc { depth=3 }
2020

2121
@@@
2222

pyrasterframes/src/main/python/docs/languages.pymd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ def sql(stmt):
6969

7070
### Step 1: Load the catalog
7171

72-
```python, step_1_sql
72+
```python, step_1_sql, results=hidden
7373
sql("CREATE OR REPLACE TEMPORARY VIEW modis USING `aws-pds-modis-catalog`")
7474
```
7575

@@ -87,7 +87,7 @@ sql('DESCRIBE red_nir_monthly_2017').show()
8787

8888
### Step 3: Read tiles
8989

90-
```python, step_3_sql
90+
```python, step_3_sql, results=hidden
9191
sql("""
9292
CREATE OR REPLACE TEMPORARY VIEW red_nir_tiles_monthly_2017
9393
USING raster
@@ -117,7 +117,7 @@ SELECT month, ndvi_stats.* FROM (
117117

118118
The latest Scala API documentation is available here:
119119

120-
* @ref:[Scala API Documentation](http://rasterframes.io/latest/api/index.html)
120+
* [Scala API Documentation](https://rasterframes.io/latest/api/index.html)
121121

122122

123123
### Step 1: Load the catalog

pyrasterframes/src/main/python/docs/unsupervised-learning.pymd

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ In this example, we will demonstrate how to fit and score an unsupervised learni
44

55
## Imports and Data Preparation
66

7-
```python, echo=False
7+
```python, setup, echo=False
88
from IPython.core.display import display
99
from docs import resource_dir_uri
1010
from pyrasterframes.utils import create_rf_spark_session
@@ -18,7 +18,7 @@ import pandas as pd
1818

1919
We import various Spark components that we need to construct our `Pipeline`.
2020

21-
```python, echo=True
21+
```python, imports, echo=True
2222
from pyrasterframes import TileExploder
2323
from pyrasterframes.rasterfunctions import rf_assemble_tile, rf_crs, rf_extent, rf_tile, rf_dimensions
2424

@@ -31,7 +31,7 @@ from pyspark.ml import Pipeline
3131
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The function `resource_dir_uri` gives a local file system path to the sample Landsat data. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
3232

3333

34-
```python, term=True
34+
```python, catalog, term=True
3535
filenamePattern = "L8-B{}-Elkton-VA.tiff"
3636
catalog_df = pd.DataFrame([
3737
{'b' + str(b): os.path.join(resource_dir_uri(), filenamePattern.format(b)) for b in range(1, 8)}
@@ -55,54 +55,54 @@ df.printSchema()
5555

5656
SparkML requires that each observation be in its own row, and features for each observation be packed into a single `Vector`. For this unsupervised learning problem, we will treat each _pixel_ as an observation and each band as a feature. The first step is to "explode" the _tiles_ into a single row per pixel. In RasterFrames, generally a pixel is called a @ref:[`cell`](concepts.md#cell).
5757

58-
```python
58+
```python, exploder
5959
exploder = TileExploder()
6060
```
6161

6262
To "vectorize" the the band columns, we use the SparkML `VectorAssembler`. Each of the seven bands is a different feature.
6363

64-
```python
64+
```python, assembler
6565
assembler = VectorAssembler() \
6666
.setInputCols(list(catalog_df.columns)) \
6767
.setOutputCol("features")
6868
```
6969

7070
For this problem, we will use the K-means clustering algorithm and configure our model to have 5 clusters.
7171

72-
```python
72+
```python, kmeans
7373
kmeans = KMeans().setK(5).setFeaturesCol('features')
7474
```
7575

7676
We can combine the above stages into a single _pipeline_.
7777

78-
```python
78+
```python, pipeline
7979
pipeline = Pipeline().setStages([exploder, assembler, kmeans])
8080
```
8181

8282
## Fit the Model and Score
8383

8484
Fitting the _pipeline_ actually executes exploding the _tiles_, assembling the features _vectors_, and fitting the K-means clustering model.
8585

86-
```python
86+
```python, fit
8787
model = pipeline.fit(df)
8888
```
8989

9090
We can use the `transform` function to score the training data in the fitted _pipeline_ model. This will add a column called `prediction` with the closest cluster identifier.
9191

92-
```python
92+
```python, transform
9393
clustered = model.transform(df)
9494
clustered.show(8)
9595
```
9696

9797
If we want to inspect the model statistics, the SparkML API requires us to go through this unfortunate contortion:
9898

99-
```python
99+
```python, cluster_stats
100100
cluster_stage = model.stages[2]
101101
```
102102

103103
We can then compute the sum of squared distances of points to their nearest center, which is elemental to most cluster quality metrics.
104104

105-
```python
105+
```python, distance
106106
metric = cluster_stage.computeCost(clustered)
107107
print("Within set sum of squared errors: %s" % metric)
108108
```
@@ -111,7 +111,7 @@ print("Within set sum of squared errors: %s" % metric)
111111

112112
We can recreate the tiled data structure using the metadata added by the `TileExploder` pipeline stage.
113113

114-
```python
114+
```python, assemble
115115
from pyrasterframes.rf_types import CellType
116116

117117
tile_dims = df.select(rf_dimensions(df.b1).alias('dims')).first()['dims']
@@ -127,6 +127,6 @@ retiled.show()
127127

128128
The resulting output is shown below.
129129

130-
```python
130+
```python, viz
131131
display(retiled.select('prediction').first()['prediction'])
132132
```

pyrasterframes/src/main/python/docs/vector-data.pymd

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@ RasterFrames provides a variety of ways to work with spatial vector (points, lin
44

55
## GeoJSON DataSource
66

7-
```python, echo=False
7+
```python, setup, echo=False
88
import pyrasterframes
99
from pyrasterframes.utils import create_rf_spark_session
1010
spark = create_rf_spark_session()
1111
```
1212

13-
```python
13+
```python, read_geojson
1414
from pyspark import SparkFiles
1515
spark.sparkContext.addFile('https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson')
1616

@@ -24,7 +24,7 @@ The properties of each feature are available as columns of the DataFrame, along
2424

2525
You can also convert a [GeoPandas][GeoPandas] GeoDataFrame to a Spark DataFrame, preserving the geometry column. This means that any vector format that can be read with [OGR][OGR] can be converted to a Spark DataFrame. In the example below, we expect the same schema as `df` defined above by the GeoJSON reader. Note that in a GeoPandas DataFrame there can be heterogeneous geometry types in the column, but this may fail Spark's schema inference.
2626

27-
```python
27+
```python, read_and_normalize
2828
import geopandas
2929
from shapely.geometry import MultiPolygon
3030

@@ -45,20 +45,20 @@ df2.printSchema()
4545

4646
The `geometry` column will have a Spark user-defined type that is compatible with [Shapely][Shapely] when working on the Python side. This means that when the data is collected to the driver, it will be a Shapely geometry object.
4747

48-
```python
48+
```python, show_geom
4949
the_first = df.first()
5050
print(type(the_first['geometry']))
5151
```
5252

5353
Since it is a geometry we can do things like this:
5454

55-
```python
55+
```python, show_wkt
5656
the_first['geometry'].wkt
5757
```
5858

5959
You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry.
6060

61-
```python
61+
```python, add_centroid
6262
from pyspark.sql.functions import udf
6363
from geomesa_pyspark.types import PointUDT
6464

@@ -72,7 +72,7 @@ df.printSchema()
7272

7373
We can take a look at a sample of the data. Notice the geometry columns print as well known text (wkt).
7474

75-
```python
75+
```python, show_centroid
7676
df.show(4)
7777
```
7878

@@ -82,7 +82,7 @@ df.show(4)
8282
As documented in the @ref:[function reference](reference.md), various user-defined functions implemented by GeoMesa are also available for use. The example below uses a GeoMesa user-defined function to compute the centroid of a geometry. It is logically equivalent to the example above, but more efficient.
8383

8484

85-
```python
85+
```python, native_centroid
8686
from pyrasterframes.rasterfunctions import st_centroid
8787
df = df.withColumn('centroid', st_centroid(df.geometry))
8888
df.select('name', 'geometry', 'naive_centroid', 'centroid').show(4)

0 commit comments

Comments
 (0)