Skip to content

Commit f14df7f

Browse files
author
Philip (flip) Kromer
committed
improvements
1 parent 63b0208 commit f14df7f

File tree

6 files changed

+71
-52
lines changed

6 files changed

+71
-52
lines changed

11b-spatial_aggregation-points.asciidoc

Lines changed: 31 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,26 +6,26 @@ We will start, as we always do, by applying patterns that turn Big Data into Muc
66
A great tool for visualizing a large spatial data set
77

88

9-
* You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
10-
* Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
11-
*
12-
*
13-
* data reduction, especially for a heatmap visualization;
14-
* extracting a continuous measurement from a pointwise sample;
15-
* providing a common basis for comparison of multiple datasets;
16-
* smoothing out spatial variation;
17-
* for all the other reasons you aggregate groups of related values in context
18-
* You have sampled data at points in order to estimate something with spatial extent. The weather dataset is an example:
19-
* Data that manifests at a single point
20-
represents a process with
21-
For example, the number of airline passengers in and out of the major airport
22-
are travelling to and from local destinations
23-
* Smoothing pointwise data
24-
into a
25-
easier to compare or manage
26-
* continuous approximation
27-
represents just the variation due to spatial
28-
variables
9+
// * You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
10+
// * Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
11+
// *
12+
// *
13+
// * data reduction, especially for a heatmap visualization;
14+
// * extracting a continuous measurement from a pointwise sample;
15+
// * providing a common basis for comparison of multiple datasets;
16+
// * smoothing out spatial variation;
17+
// * for all the other reasons you aggregate groups of related values in context
18+
// * You have sampled data at points in order to estimate something with spatial extent. The weather dataset is an example:
19+
// * Data that manifests at a single point
20+
// represents a process with
21+
// For example, the number of airline passengers in and out of the major airport
22+
// are travelling to and from local destinations
23+
// * Smoothing pointwise data
24+
// into a
25+
// easier to compare or manage
26+
// * continuous approximation
27+
// represents just the variation due to spatial
28+
// variables
2929

3030
The straightforward approach we'll take is to divide the world up into a grid of tiles and map the position of each point onto the unique grid tile it occupies. We can then group on each tile
3131

@@ -86,3 +86,14 @@ An epidemiologist or transportation analyst interested in knowing the large-scal
8686
* _Exercises_ --
8787
* _Important to Know_ --
8888
- A https://en.wikipedia.org/wiki/Dot_distribution_map[Dot Distribution Map] is in some sense the counterpart to a spatial average -- turning data over a region into data at synthesized points
89+
90+
91+
=== Matching Points within a Given Distance (Pointwise Spatial Join)
92+
93+
94+
95+
==== Distance isn't as it seems
96+
97+
The picture below shows a circle, 350 km in radius, centered at 60 degrees latitude (up near Helsinki). What you should see that the lines of constant longitude "come together" faster than the curve of the circle does. This is most
98+
99+
image::images/11-circle_of_constant_distance.png[Min/Max Longitudes are not at the same latitude as the center]

11c-geospatial_mechanics.asciidoc

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,20 @@
11
=== Mechanics of Spatial Data
22

3-
We kicked off the chapter with an interesting example that didn't require too many new concepts, but it's time to backtrack a bit and properly cover the mechanics of working with spatial data.
3+
We kicked off the chapter with two examples that didn't require too many new concepts, but it's time to backtrack a bit and properly cover the mechanics of working with spatial data.
44

55
The data types and operations are extremely well standardized by the http://www.opengeospatial.org/standards/sfa[Open Geospatial Consortium]. Nearly all of the operations below have identical behavior within Oracle, PostGIS, SQL Server, and all industrial-strength geospatial systems. In fact, the geospatial toolkits for Pig (Pigeon) and Hive (Esri-SFFH) are particularly sympatico as they both use Esri's wonderful https://github.com/Esri/geometry-api-java[Esri Geometry API] under the hood.
66

77
==== Spatial Data Types
88

9-
* `Point` --
10-
* `LineString` --
11-
* `Polygon` --
12-
* `MultiPoint` --
13-
* `MultiLineString` -- Although a `Polygon` also has multiple chains of coordinates, a `Polygon` is not a `MultiLineString`. Most importantly, a `Polygon` represent a 2-D shape, and the coordinates delimit the continuous dense set of points in its interior; a `MultiLineString` represents a 1-D shape, and the coordinates delimit the endpoints of its line segments. What's more, the line strings defining a polygon must be 'rings': closed paths that do not cross or touch; the elements of a `LineString` or `MultiLineString` are allowed to be either open or closed and may cross or touch.
14-
* `MultiPolygon` -- you guessed it, a collection of polygons. They're allowed to overlap, lie within each other, or anything else they want to do.
9+
* `Point` -- a single location in space, given by its horizontal, then vertical coordinates. That's an easy convention to swallow when you think in terms `x`, `y` -- but also means you should always list coordinates in the order longitude first then latitude. Get in the habit of always using that ordering.
10+
* `LineString` -- a single continuous path, described as an ordered sequence of points. To describe a closed path, repeat the line's start point as its end point. A path is 'simple' if it does not cross or touch itself; a path is a 'ring' if it is both simple and closed.
11+
* `Polygon` -- a connected surface in space, described by at least one closed simple path defining its exterior, and zero one or many non-crossing rings defining any interior holes. The exterior ring is always listed first, and no ring is permitted to cross or touch itself or any other ring.
12+
* `MultiPoint` -- a collection of points regarded as a single shape.
13+
* `MultiLineString` -- a collection of lines regarded as a single shape. Although a `Polygon` also has multiple chains of coordinates, a `Polygon` is not a `MultiLineString`. Most importantly, a `Polygon` represent a 2-D shape with an interior; a `MultiLineString` represents a collection of 1-D shapes. What's more, the line strings defining a polygon must be non-intersecting rings, while the elements of a `LineString` or `MultiLineString` are permitted to be either open or closed, and may cross or touch.
14+
* `MultiPolygon` -- you guessed it, a collection of polygons regarded as a single shape. These polygons are allowed to overlap, lie within each other, or anything else they want to do.
15+
* `Envelope` -- an axis-aligned rectangle depicting the minimum and maximum extent of a shape in each coordinate. Since its sides are aligned with the axes, we only have to give the coordinates of two of its corners. From the perspective of the geometry libraries this does not live in the same type hierarchy as the geometry objects above, but it's easy enough to generate the polygon corresponding to an envelope or the envelope of any shape. Any time you're specifying a bounding box, follow the convention of numerically-lowest-coordinates then numerically-highest-coordinates, i.e. `( (min_x, min_y), (max_x, max_y) )`. Like the longitude-then-latitude convention, it's violated just often enough to drive you crazy.
1516

16-
Those are the essential data types used by the underlying geospatial methods. However, when adapting geospatial methods to Pig there are really only two families of shapes to consider:
17+
Those are the essential data types used by geospatial libraries everywhere. However, when adapting geospatial methods to Pig there are really only two families of shapes to consider:
1718

1819
* Points, which lack spatial extent
1920
* Regions (i.e. all geometries that are not of type `Point`), which span more than one location in space
@@ -78,7 +79,7 @@ Somewhat related are operations that bring shapes in and out of Pig's control.
7879
// * Curve, Surface, MultiCurve, MultiSurface, GeomCollection, Geometry
7980
// * M, Z / MaxZ / MaxM / MinM / MinZ
8081

81-
==== Simple Properties of Shapes
82+
==== Properties of Shapes
8283

8384
* `GeoArea(geom)`
8485
* `MinX(geom)`, `MinY(geom)`, `MaxX(geom)`, `MaxY(geom)` -- the numerically greatest and least extent of a shape in the specified dimension.
@@ -100,7 +101,7 @@ There are also a set of meta-operations that report on the geometry objects repr
100101
* `GeometryType(geom)` -- string representing the geometry type: `'Point'`, `'LineString'`, ..., `'MultiPolygon'`.
101102
* `IsGeomEmpty(geom)` -- 1 if the geometry contains no actual points.
102103
* `IsLineClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
103-
* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. (TODO: tasteful joke goes here.)
104+
* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.
104105
* `IsLineRing` -- 1 if the given `LineString` is a ring -- that is, closed and simple.
105106

106107
* `NumGeometries(geom_collection)`

11c-spatial_aggregations_on_regions.asciidoc

Lines changed: 26 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@ While spatially aggregating pointwise data required nothing terribly sophisticat
44

55
Fair warning: though the first version of this script we demonstrate for you is correct as it stands, it will be exceptionally inefficient. But we want to demonstate the problem and then discuss its elegant solution.
66

7-
X points Y gridcells occupied ~ 6000 grid cells (50 x 120)
8-
7+
// X points Y gridcells occupied ~ 6000 grid cells (50 x 120)
98

109
The key first step is to assemble a table giving the weight of each region's contribution to the tile.
1110

@@ -49,53 +48,61 @@ Joining that to the 2.3 million rows of crop production data produces big data f
4948
5049
// If what follows doesn't set your heart singing, you might have chosen the wrong profession.
5150
52-
Our solution will follow the same logic as the solution JT learned from the friendly pianist.
53-
Instead of breaking down
51+
// Our solution will follow the same logic as the solution JT learned from the friendly pianist.
5452
55-
image::images/images/11-f-quad_decompositions/spatial_manor-rooms_peeps_grid.png[Spatial Manor - Rooms (and underlying ZL-3 grid)]
53+
Instead of breaking down
5654
57-
image::images/images/11-f-quad_decompositions/spatial_manor-quadkeys.png[Spatial Manor - Decomposed]
55+
image::images/11-f-quad_decompositions/11-quaddecomp-world-mercator.png[Variable Level of Detail prevents Inefficient Tiny Tiles]
5856
5957
The QuadDecompose UDF accepts an outer and an inner zoom level.
6058
Every region will be broken down to at least the coarser zoom level; you'll use this grid size as the partition key (in map/reduce) or group key (in Pig).
6159
No region will be broken into tiles smaller than the inner zoom level, which is important for both managing the data volume and for ensuring we don't group data into finer bins than it can support (see the section on "Choosing a Histogram Bin Size" (REF)).
6260
6361
In the graphic (REF), you can see the result:
6462
the top two images, showing the tiles decomposed naively (on the left) and hierarchically (on the right), are identical. The lower row has the same data but with a light border around each tile in the heatmap. The large interior portions of Russia, China, the USA, Brazil and others are now represented by coarser zoom-level blocks. Even France and Argentina manage to line up their borders fortuitously for a zoom-level 6 (TODO check size) block. In all, we reduced the number of tiles from XXX down to XXX without impacting accuracy.
65-
63+
//
6664
TODO: move to join part below.
6765
6866
Since every point on the map is covered by at most one region for the current data set, you'll find that every tile is either (a) wholly in the interior of a country, or (b) at the finest zoom level: there's no way for Russia to send data off to a huge zoom-level 5 tile while Latvia sends data to one of its zoom-level 7 children.
6967
The zoom-level chosen for each tile in the `FOREACH...QuadDecompose` pass was the correct granularity for the result.
68+
7069
// TODO: reword
7170
71+
=== Quadtree Decomposition and Numbering
7272
73+
Let's look closer at the quadtree scheme.
7374
74-
=== Projections and Tiling Schemes
75+
Our Reindeer friends, with their ample free time and bizarre sense of humor, sometimes vacation at a quaint vacation estate known as Spatial Manor. On the one hand, every once in a while somebody attacks a fellow guest with candlestick or rope or whatever's close at hand. On the other hand, though, most people _aren't_ killed and get to enjoy the rousing sport of solving a mystery!
7576
77+
We can apply our spatial analysis tools to its simple geometry and help the investigators make a data-backed decision. Here is a map of the grounds.
7678
77-
* Equal-area:
78-
- features uniformly distributed on the globe will be uniformly distributed among grid cells.
79-
* Platte-Careé (Equirectangular)
80-
- Extremely simple to compute
81-
- Plot directly into screen coordinates with
82-
-
79+
image::images/images/11-f-quad_decompositions/spatial_manor-rooms_peeps_grid.png[Spatial Manor - Rooms (and underlying ZL-3 grid)]
80+
81+
image::images/images/11-f-quad_decompositions/spatial_manor-quadkeys.png[Spatial Manor - Decomposed]
82+
83+
84+
85+
// === Projections and Tiling Schemes
86+
//
87+
//
88+
// * Equal-area:
89+
// - features uniformly distributed on the globe will be uniformly distributed among grid cells.
90+
// * Platte-Careé (Equirectangular)
91+
// - Extremely simple to compute
92+
// - Plot directly into screen coordinates with
93+
// -
8394
8495
==== Exporting data for Presentation by a Tileserver
8596
86-
The most commonly
97+
// The most commonly
8798
8899
Features following a constant bearing in any direction -- Manhattan's Broadway, or the borderlines of Algeria or Nevada -- remain straight lines on the map. This is important for navigational purposes
89100
90101
The locality properties of quadtiles indexing
91102
92-
93-
94103
Typically you will store the shape clipped to the given quad-tile.
95104
metadata about that region -- population, metric tons of bananas exported annually, an image of its flag, lyrics to its national anthem -- is stored under the same key but in independent columns footnote:[typically the regions are heavyweight, heavily requested and read-only, so they deserve their own table or at least their own column family.
96105
97-
98-
99106
You don't have to break
100107
What we do is
101108
When tile 0231_1 is requested
152 KB
Loading
462 KB
Loading

images/Quadtiles-ClueRegions.graffle

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18075,7 +18075,7 @@
1807518075
<key>Print</key>
1807618076
<string>YES</string>
1807718077
<key>View</key>
18078-
<string>NO</string>
18078+
<string>YES</string>
1807918079
</dict>
1808018080
<dict>
1808118081
<key>Lock</key>
@@ -18160,7 +18160,7 @@
1816018160
<key>MasterSheets</key>
1816118161
<array/>
1816218162
<key>ModificationDate</key>
18163-
<string>2014-07-30 06:39:19 +0000</string>
18163+
<string>2014-07-31 18:49:24 +0000</string>
1816418164
<key>Modifier</key>
1816518165
<string>Philip flip Kromer</string>
1816618166
<key>NotesVisible</key>
@@ -18258,7 +18258,7 @@
1825818258
<key>SidebarWidth</key>
1825918259
<integer>120</integer>
1826018260
<key>VisibleRegion</key>
18261-
<string>{{-363, 2}, {1331, 1049}}</string>
18261+
<string>{{-363, 1}, {1331, 1049}}</string>
1826218262
<key>Zoom</key>
1826318263
<real>1</real>
1826418264
<key>ZoomValues</key>

0 commit comments

Comments
 (0)