You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 11-geographic.asciidoc
+270-3Lines changed: 270 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,6 +51,273 @@ powerful scripts that give actionable insight.
51
51
52
52
53
53
54
+
* Mapping from earth to grid
55
+
* Extending the base of analytic patterns to two and more dimensions
56
+
57
+
long/lat implies elevation -- For the great majority of questions of interest, elevation is irrelevant, and so we discard it
58
+
We just need to rough-cut the data and then turn it over to dedicated spatial methods
59
+
60
+
61
+
Our approach will be consistently to
62
+
apply the toolkit of Analytic Patterns that we assembled in Chapters 4-9 (REF)
63
+
to put all relevant data into context,
64
+
and then turn to specialized geospatial libraries to
65
+
synthesize results
66
+
67
+
The large-scale part of this demands no great sophistication
68
+
We can pretend that circles are rectangular, that shapes do not ever have holes, that the earth is not only a perfect sphere but in fact is a planar grid. All manner of convenient distortions
69
+
so outrageous they literally tear the space-time continuum
70
+
are allowable as long as they obey the fundamental strategic rule:
71
+
72
+
* put all data that might form relevant context together
73
+
74
+
(TODO: better phrasing)
75
+
76
+
=== Points, Paths and Regions
77
+
78
+
79
+
80
+
81
+
==== Geometry Primitives: Points, Polygons and so forth
82
+
83
+
Back in Chapter 4 (REF), we introduced the simple scalar types (numbers, strings, etc.) and three complex types (`tuple`, `bag`, and `map`). Since every spatial analysis exploration involves
84
+
85
+
One thing
86
+
Spatial analysis libraries
87
+
rely on the http://www.opengeospatial.org/[OGC (Open Geospatial Consortium)]
88
+
89
+
Geometry
90
+
91
+
Point, LineString, Polygon; and corresponding multi-part geometries MultiPoint, MultiLineString, MultiPolygon.
92
+
93
+
Behind these smiling friendly inviting abstraction
94
+
lies
95
+
a host of diabolical complexities
96
+
97
+
In regular usage, even double-precision floating-point math can introduce
98
+
discrepancies large enough to incalidate results
99
+
or present visual artifacts
100
+
-- pushing the boundary of a shape off the shape itself, causing tears or overlaps where there were none, turning small polygons into degenerate points, introduce numerical instability
101
+
102
+
But for the big data section of it, where we are chiefly concerned with relating data in context,
103
+
there are really only these
104
+
105
+
* a point in space
106
+
* a spatial extent -- paths, regions, etc
107
+
// * non-spatial data
108
+
109
+
In fact, we can go even farther:
110
+
111
+
* points
112
+
* rectangles
113
+
114
+
Remember, all we're trying to do is land all (possibly) related data onto the same reducer before we bring in the big guns.
115
+
116
+
117
+
118
+
==== Smoothing Pointwise Data Locally (Spatial Aggregation of Points)
119
+
120
+
121
+
122
+
* You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
123
+
* Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
124
+
*
125
+
*
126
+
127
+
128
+
129
+
130
+
===== Pattern Recap: Spatial Aggregation of Points
131
+
132
+
133
+
*
134
+
* _Generic Example_ -- mmm
135
+
* _When You'll Use It_ -- as mentioned above:
136
+
* data reduction, especially for a heatmap visualization;
137
+
* extracting a continuous measurement from a pointwise sample;
138
+
* providing a common basis for comparison of multiple datasets;
139
+
* smoothing out spatial variation;
140
+
* for all the other reasons you aggregate groups of related values in context
141
+
* _Exercises_ --
142
+
143
+
144
+
145
+
146
+
147
+
148
+
==== Smoothing Regional Data onto a Consistent Grid (Spatial Aggregation of Regions)
* Common sense tells you that a weather observation is generally valid for places within a few kilometers, but certainly not useful for places Hundreds of kilometers away. It would be useful to have a more precise guideline for the distance where a weather measurement should not be considered reliable.
166
+
* first find all pairs of weather stations within 50 km of each other. Emit each pair of IDs with the lower-numbered ID in the first slot (making it easy to ensure uniqueness).
167
+
* for each such pair, take a year of weather observations and determine the difference in temperature measurements taken at the same hour
168
+
* HashMap (replicated) join of station-station pairs on the observations table. You could also do a total sort of the pairs table and use a merge-join if you're memory constrained.
169
+
* join the resulting table back onto the observations table.
170
+
* (In this case, most weather stations are a part of at least one pair, and so most of the rows in the observations table are retained. If that weren't the case,
171
+
*
172
+
173
+
* As the radius expands, you'll quickly find that the amount of data begins to explode, so restrict that upper radius band initially.
174
+
* (is this also a problem: "You might have also noticed another problem. Even apart from a distance effect, with more neighbors there are more opportunities for observations to disagree.")
175
+
*
176
+
* (you should know that the answer has some bias -- places with a large concentration of weather stations are typically heavily populated, and heavily populated places don't tend to have extreme weather. We're just looking for a good rule-of-thumb though)
177
+
178
+
179
+
What makes a good exemplar?
180
+
* Head-of-the-tail --
181
+
* extreme specimens will pop on their own. You want to see what's happening to the
182
+
* Ones that are unusual without being weird. The solstice is
183
+
* Essential troublemakers: leap years, the centennial leap-year-exceptions, and the quad-centennial leap-year-exception-exceptions.
184
+
* Well represented
185
+
* it's no fun if your exemplars disappear mid-journey -- most commonly because they failed to find a match during a join.
186
+
* Chosen by out-of-band criteria -- deciding to look for "this date three years ago" and then finding a record is better than choosing the first record you see -- that particular record may have been the first one you saw because it is unrepresentative in some way.
187
+
* just as a magician will pull back their shirtsleeves to show they have no rabbit concealed within, this keeps you from fooling yourself. http://en.wikipedia.org/wiki/Nothing_up_my_sleeve_number
188
+
* (in fact, Cryptographers have a concept of a "nothing-up-my-sleeve" number: when a large arbitrary collection of numbers is needed, choosing the first twenty-five digits of Pi is believably arbitrary, whereas choosing the 387'th through 412'th digits raises the specter of a purposeful "backdoor").
189
+
190
+
191
+
192
+
We will start, as we always do, by applying patterns that turn Big Data into Much a Less Data. In particular,
193
+
A great tool for visualizing a large spatial data set
194
+
195
+
196
+
197
+
==== Smoothing Pointwise Data Locally (Spatial Aggregations)
198
+
199
+
200
+
201
+
There are a great many occasions where it's useful to translate
202
+
203
+
* You have sampled data at points in order to estimate something with spatial extent. The weather dataset is an example:
204
+
205
+
* Data that manifests at a single point
206
+
represents a process with
207
+
For example, the number of airline passengers in and out of the major airport
208
+
are travelling to and from local destinations
209
+
210
+
* Smoothing pointwise data
211
+
into a
212
+
easier to compare or manage
213
+
214
+
* continuous approximation
215
+
represents just the variation due to spatial
216
+
variables
217
+
218
+
219
+
The straightforward approach we'll take is to divide the world up into a grid of tiles and map the position of each point onto the unique grid tile it occupies. We can then group on each tile
220
+
221
+
// TODO-qem: do we use just plain x / y coordinates?
222
+
223
+
footnote:[Instead of the ]
224
+
225
+
226
+
Area of a spherical segment is 2*pi*R*h --
227
+
so for lat from equator to 60
228
+
229
+
230
+
===== Pattern in Use
231
+
232
+
233
+
* _Further Reading_:
234
+
- A https://en.wikipedia.org/wiki/Dot_distribution_map[Dot Distribution Map] is in some sense the counterpart to a spatial average.
235
+
236
+
237
+
==== Exporting data for Presentation by a Tileserver
238
+
239
+
==== Finding the Centroid of an Extent
240
+
241
+
242
+
243
+
==== Finding the Bounding Box of an Extent
244
+
245
+
246
+
==== Finding the Bounding Box of Points Within a Radius
247
+
248
+
249
+
250
+
251
+
252
+
* _choose exemplars_:
253
+
- Midway, because it's large; Austin, because it's one of our exemplar cities; and (TODO something tiny) because it's very small.
254
+
- the sightings X, y, which each have a fun description and are near multiple airports; and Z, which is not near an airport.
255
+
- weather observations:
256
+
- a date with a new moon and a full moon; 8/8/08, because auspicious; an equinox and a solstice
257
+
-
258
+
259
+
260
+
==== Combining Regions with Set Operations
261
+
262
+
(intersection, union, diff, xor)
263
+
264
+
265
+
==== Testing the Relationship of two Regions
266
+
267
+
DE-9IM
268
+
269
+
equals
270
+
disjoint
271
+
touches
272
+
contains
273
+
covers
274
+
275
+
intersects,
276
+
within
277
+
covered_by
278
+
279
+
crosses
280
+
overlaps
281
+
282
+
From Wikipedia:
283
+
284
+
Equals: a = b that is (a ∩ b = a) ∧ (a ∩ b = b)
285
+
Within: a ∩ b = a
286
+
Intersects: a ∩ b ≠ ∅
287
+
Touches: (a ∩ b ≠ ∅) ∧ (aο ∩ bο = ∅)
288
+
289
+
point/point Equals, Disjoint Other valid predicates collapses into Equals.
290
+
point/line adds Intersects Intersects is a flexibilization of Equals, "some equal point at the line".
291
+
line/line adds Touches, Crosses, ... Touches is a constraint of Intersects, about "only boundaries"; Crosses about "only one point".
292
+
293
+
{0,1,2,T,F,*} -- dimensions 0, 1, 2; T / F; dont-care
294
+
295
+
296
+
297
+
=== Key Strategic Pattern: Tile / Cull / Process
298
+
299
+
300
+
* _Tile_ -- tile the grid
301
+
* _Cull_ -- eliminate
302
+
* _Process_ --
303
+
304
+
=== Matching Points in a Table with Nearby Points in Another (Spatial Join)
305
+
306
+
307
+
* scatter points to nine tiles
308
+
309
+
310
+
=== Matching Points with the Regions
311
+
312
+
313
+
314
+
315
+
316
+
317
+
318
+
319
+
320
+
54
321
=== Mechanics of Geographic Data
55
322
56
323
==== Longitude and Latitude, Points and Features
@@ -64,12 +331,12 @@ powerful scripts that give actionable insight.
- https://github.com/Esri/geometry-api-java -- The Esri Geometry API for Java enables developers to write custom applications for analysis of spatial data. This API is used in the Esri GIS Tools for Hadoop and other 3rd-party data processing solutions.
0 commit comments