Skip to content

Commit 25e7c4b

Browse files
authored
Merge pull request #299 from s22s/docs/time-series-improvements
Time series docs improvements
2 parents 17e6938 + f91193b commit 25e7c4b

File tree

4 files changed

+83
-31
lines changed

4 files changed

+83
-31
lines changed

.circleci/config.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,11 +75,16 @@ jobs:
7575
command: |
7676
mkdir -p /tmp/core_dumps
7777
cp core.* *.hs /tmp/core_dumps 2> /dev/null || true
78+
mkdir -p /tmp/markdown
79+
cp /home/circleci/repo/pyrasterframes/target/python/docs/*.md /tmp/markdown 2> /dev/null || true
7880
when: on_fail
7981

8082
- store_artifacts:
8183
path: /tmp/core_dumps
8284

85+
- store_artifacts:
86+
path: /tmp/markdown
87+
8388
- store_artifacts:
8489
path: docs/target/site
8590
destination: rf-site

pyrasterframes/src/main/python/docs/time-series.pymd

Lines changed: 75 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,16 @@
44

55
```python setup, echo=False
66
from IPython.display import display
7+
78
import pyrasterframes
89
from pyrasterframes.rasterfunctions import *
910
import pyrasterframes.rf_ipython
11+
12+
import folium
13+
14+
from pyspark.sql.functions import udf, lit
15+
from geomesa_pyspark.types import MultiPolygonUDT
16+
1017
# This job is more memory bound, so reduce the concurrent tasks.
1118
spark = pyrasterframes.get_spark_session("local[4]")
1219
```
@@ -20,47 +27,86 @@ cat.printSchema()
2027

2128
We will summarize the change in NDVI over 2018 in the Cuyahoga Valley National Park in Ohio, USA. First, we will retrieve open vector data delineating the park boundary from the US National Park Service's LandsNet.
2229

30+
## Vector Data
31+
32+
First we will get the vector data from LandsNet service by a REST query. The data is saved to a geojson file.
33+
2334
```python get_park_boundary
2435
import requests
25-
import geopandas
26-
nps_data_query_url = 'https://services1.arcgis.com/fBc8EJBxQRMcHlei/arcgis/rest/services/NPS_Park_Boundaries/FeatureServer/0/query?where=1%3D1&outFields=*&geometry=-82.451%2C41.075%2C-80.682%2C41.436&geometryType=esriGeometryEnvelope&inSR=4326&spatialRel=esriSpatialRelIntersects&outSR=4326&f=json'
36+
nps_filepath = '/tmp/parks.geojson'
37+
nps_data_query_url = 'https://services1.arcgis.com/fBc8EJBxQRMcHlei/arcgis/rest/services/' \
38+
'NPS_Park_Boundaries/FeatureServer/0/query' \
39+
'?geometry=-82.451,41.075,-80.682,41.436&inSR=4326&outSR=4326&f=geojson'
2740
r = requests.get(nps_data_query_url)
28-
with open('/tmp/parks.geojson', 'wb') as f:
29-
for chunk in r.iter_content(chunk_size=128):
30-
f.write(chunk)
41+
with open(nps_filepath,'wb') as f:
42+
f.write(r.content)
43+
```
3144

32-
park_df = geopandas.read_file('/tmp/parks.geojson')
33-
park_geo = park_df[park_df.UNIT_CODE=='CUVA'].geometry[0]
45+
```python, folium_map,
46+
m = folium.Map((41.25,-81.6), zoom_start=10).add_child(folium.GeoJson(nps_filepath))
47+
```
3448

35-
park_geo.wkt[:100]
49+
```python, folium_persist, echo=False
50+
# this is the work around for ability to render the folium map in the docs build
51+
import base64
52+
temp_folium = 'docs/static/__cuya__.html'
53+
m.save(temp_folium)
54+
with open(temp_folium, 'rb') as f:
55+
b64 = base64.b64encode(f.read())
56+
with open('docs/static/cuya.md', 'w') as md:
57+
md.write('<iframe src="data:text/html;charset=utf-8;base64,{}" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" style="position:relative;width:100%;height:500"></iframe>'.format(b64.decode('utf-8')))
58+
# seems that the height is not correct?
3659
```
3760

61+
@@include[folium_static](static/cuya.md)
62+
63+
Now we read the park boundary vector data as a Spark DataFrame using the built-in @ref:[geojson DataSource](vector-data.md#geojson-datasource). The geometry is very detailed, and the EO cells are relatively coarse. To speed up the processing, the geometry is "simplified" by combining vertices within about 100 meters of each other. For more on this see the section on Shapely support in @ref:[user defined functions](vector-data.md#shapely-geometry-support).
3864

39-
The entire park boundary is contained in MODIS granule h11 v4. We will simply filter on this granule, rather than using a @ref:[spatial relation](vector-data.md#geomesa-functions-and-spatial-relations). The time period selected should show the change in plant vigor as leaves emerge over the spring and into early summer.
65+
```python read_cuya_vector
66+
park_vector = spark.read.geojson(nps_filepath)
67+
68+
@udf(MultiPolygonUDT())
69+
def simplify(g, tol):
70+
return g.simplify(tol)
71+
72+
park_vector = park_vector.withColumn('geo_simp', simplify('geometry', lit(0.001))) \
73+
.select('geo_simp') \
74+
.hint('broadcast')
75+
```
76+
77+
## Catalog Read
78+
79+
The entire park boundary is contained in MODIS granule h11 v04. We will simply filter on this granule, rather than using a @ref:[spatial relation](vector-data.md#geomesa-functions-and-spatial-relations). The time period selected should show the change in plant vigor as leaves emerge over the spring and into early summer.
4080

4181
```python query_catalog
42-
from pyspark.sql.functions import lit
43-
park_cat = cat.filter(
82+
park_cat = cat \
83+
.filter(
4484
(cat.granule_id == 'h11v04') &
4585
(cat.acquisition_date > lit('2018-02-19')) &
46-
(cat.acquisition_date < lit('2018-07-01'))
47-
)
86+
(cat.acquisition_date < lit('2018-07-01'))
87+
) \
88+
.crossJoin(park_vector)
89+
4890
park_cat.printSchema()
4991
```
5092

51-
## Catalog Read
52-
5393
Now we have a catalog with several months of MODIS data for a single granule. However, the granule is larger than our park boundary. We will combine the park geometry with the catalog, and read only the bands of interest to compute NDVI, which we discussed in a @ref:[previous section](local-algebra.md#computing-ndvi).
5494

95+
We then [reproject](https://gis.stackexchange.com/questions/247770/understanding-reprojection) the park geometry to the same @ref:[CRS](concepts.md#coordinate-reference-system--crs-) as the imagery. Then we will filter to only the _tiles_ intersecting the park.
96+
5597
```python read_catalog
5698
raster_cols = ['B01', 'B02',] # red and near-infrared respectively
5799
park_rf = spark.read.raster(
58-
catalog=park_cat.select(['acquisition_date', 'granule_id'] + raster_cols),
59-
catalog_col_names=raster_cols
60-
) \
61-
.withColumn('park', st_geomFromWKT(lit(park_geo.wkt)))
100+
catalog=park_cat.select(['acquisition_date', 'granule_id', 'geo_simp'] + raster_cols),
101+
catalog_col_names=raster_cols) \
102+
.withColumn('park_native', st_reproject('geo_simp', lit('EPSG:4326'), rf_crs('B01'))) \
103+
.filter(st_intersects('park_native', rf_geometry('B01')))
104+
62105
park_rf.printSchema()
63-
park_rf.persist()
106+
```
107+
108+
```python persist_catalog, echo=False
109+
# park_rf.persist()
64110
```
65111

66112
## Vector and Raster Data Interaction
@@ -70,14 +116,12 @@ Now we have the vector representation of the park boundary alongside the _tiles_
70116
We do this using two transformations. The first one will reproject the park boundary from coordinates to the MODIS sinusoidal projection. The second one will create a new _tile_ aligned with the imagery containing a value of 1 where the pixels are contained within the park and NoData elsewhere.
71117

72118
```python burn_in
73-
cr_1 = park_rf.withColumn('park_native', st_reproject('park', lit('EPSG:4326'), rf_crs('B01')))
74-
75-
cr_2 = cr_1 \
119+
rf_park_tile = park_rf \
76120
.withColumn('dims', rf_dimensions('B01')) \
77121
.withColumn('park_tile', rf_rasterize('park_native', rf_geometry('B01'), lit(1), 'dims.cols', 'dims.rows')) \
78-
.where(rf_tile_sum('park_tile') > 0)
79-
cr_2.printSchema()
80-
cr_2.persist()
122+
.persist()
123+
124+
rf_park_tile.printSchema()
81125
```
82126

83127
## Create Time Series
@@ -88,22 +132,23 @@ Next, we will compute NDVI as the normalized difference of near infrared (band 2
88132
from pyspark.sql.functions import col, year, weekofyear, month
89133
from pyspark.sql.functions import sum as sql_sum
90134

91-
rf_ndvi = cr_2.withColumn('ndvi', rf_normalized_difference('B02', 'B01')) \
92-
.withColumn('ndvi_masked', rf_mask('ndvi', 'park_tile'))
135+
rf_ndvi = rf_park_tile \
136+
.withColumn('ndvi', rf_normalized_difference('B02', 'B01')) \
137+
.withColumn('ndvi_masked', rf_mask('ndvi', 'park_tile'))
93138

94139
time_series = rf_ndvi \
95140
.withColumn('ndvi_wt', rf_tile_sum('ndvi_masked')) \
96141
.withColumn('wt', rf_data_cells('ndvi_masked')) \
97142
.groupby(year('acquisition_date').alias('year'), weekofyear('acquisition_date').alias('week')) \
98143
.agg(sql_sum('ndvi_wt').alias('ndvi_wt_wk'), sql_sum('wt').alias('wt_wk')) \
99144
.withColumn('ndvi', col('ndvi_wt_wk') / col('wt_wk'))
145+
100146
time_series.printSchema()
101-
time_series.persist()
102147
```
103148

104149
Finally, we will take a look at the NDVI over time.
105150

106-
```python time_series_display
151+
```python time_series_display, evaluate=True
107152
import matplotlib.pyplot as plt
108153

109154
time_series_pdf = time_series.toPandas()

pyrasterframes/src/main/python/requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,5 @@ numpy>=1.7
55
pandas>=0.25.0
66
matplotlib<3.0.0 # no python 2.7 support after v2.x.x
77
ipython==6.2.1
8-
rasterio>=1.0.0
8+
rasterio>=1.0.0
9+
folium # for documentation

pyrasterframes/src/main/python/setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,7 @@ def dest_file(self, src_file):
167167
'Pweave==0.30.3',
168168
'fiona==1.8.6',
169169
'rasterio>=1.0.0', # for docs
170+
'folium',
170171
],
171172
tests_require=[
172173
'pytest==3.4.2',

0 commit comments

Comments
 (0)