improvements

Philip (flip) Kromer · Philip (flip) Kromer · commit f14df7fb6942 · 2014-07-31T14:26:43.000-05:00
diff --git a/11b-spatial_aggregation-points.asciidoc b/11b-spatial_aggregation-points.asciidoc
@@ -6,26 +6,26 @@ We will start, as we always do, by applying patterns that turn Big Data into Muc
 A great tool for visualizing a large spatial data set
 
 
-* You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
-* Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
-*
-*
-* data reduction, especially for a heatmap visualization;
-* extracting a continuous measurement from a pointwise sample;
-* providing a common basis for comparison of multiple datasets;
-* smoothing out spatial variation;
-* for all the other reasons you aggregate groups of related values in context
-* You have sampled data at points in order to estimate something with spatial extent. The weather dataset is an example:
-* Data that manifests at a single point
-  represents a process with
-  For example, the number of airline passengers in and out of the major airport
-  are travelling to and from local destinations
-* Smoothing pointwise data
-  into a
-  easier to compare or manage
-* continuous approximation
-  represents just the variation due to spatial
-  variables
+// * You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
+// * Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
+// *
+// *
+// * data reduction, especially for a heatmap visualization;
+// * extracting a continuous measurement from a pointwise sample;
+// * providing a common basis for comparison of multiple datasets;
+// * smoothing out spatial variation;
+// * for all the other reasons you aggregate groups of related values in context
+// * You have sampled data at points in order to estimate something with spatial extent. The weather dataset is an example:
+// * Data that manifests at a single point
+//   represents a process with
+//   For example, the number of airline passengers in and out of the major airport
+//   are travelling to and from local destinations
+// * Smoothing pointwise data
+//   into a
+//   easier to compare or manage
+// * continuous approximation
+//   represents just the variation due to spatial
+//   variables
 
 The straightforward approach we'll take is to divide the world up into a grid of tiles and map the position of each point onto the unique grid tile it occupies. We can then group on each tile
 
@@ -86,3 +86,14 @@ An epidemiologist or transportation analyst interested in knowing the large-scal
 * _Exercises_ --
 * _Important to Know_ --
   - A https://en.wikipedia.org/wiki/Dot_distribution_map[Dot Distribution Map] is in some sense the counterpart to a spatial average -- turning data over a region into data at synthesized points
+
+
+=== Matching Points within a Given Distance (Pointwise Spatial Join)
+
+
+
+==== Distance isn't as it seems
+
+The picture below shows a circle, 350 km in radius, centered at 60 degrees latitude (up near Helsinki). What you should see that the lines of constant longitude "come together" faster than the curve of the circle does. This is most 
+
+image::images/11-circle_of_constant_distance.png[Min/Max Longitudes are not at the same latitude as the center]
diff --git a/11c-geospatial_mechanics.asciidoc b/11c-geospatial_mechanics.asciidoc
@@ -1,19 +1,20 @@
 === Mechanics of Spatial Data
 
-We kicked off the chapter with an interesting example that didn't require too many new concepts, but it's time to backtrack a bit and properly cover the mechanics of working with spatial data. 
+We kicked off the chapter with two examples that didn't require too many new concepts, but it's time to backtrack a bit and properly cover the mechanics of working with spatial data. 
 
 The data types and operations are extremely well standardized by the http://www.opengeospatial.org/standards/sfa[Open Geospatial Consortium]. Nearly all of the operations below have identical behavior within Oracle, PostGIS, SQL Server, and all industrial-strength geospatial systems. In fact, the geospatial toolkits for Pig (Pigeon) and Hive (Esri-SFFH) are particularly sympatico as they both use Esri's wonderful https://github.com/Esri/geometry-api-java[Esri Geometry API] under the hood.
 
 ==== Spatial Data Types
 
-* `Point` --
-* `LineString` --
-* `Polygon`  -- 
-* `MultiPoint` --
-* `MultiLineString` -- Although a `Polygon` also has multiple chains of coordinates, a `Polygon` is not a `MultiLineString`. Most importantly, a `Polygon` represent a 2-D shape, and the coordinates delimit the continuous dense set of points in its interior; a `MultiLineString` represents a 1-D shape, and the coordinates delimit the endpoints of its line segments. What's more, the line strings defining a polygon must be 'rings': closed paths that do not cross or touch; the elements of a `LineString` or `MultiLineString` are allowed to be either open or closed and may cross or touch.
-* `MultiPolygon` -- you guessed it, a collection of polygons. They're allowed to overlap, lie within each other, or anything else they want to do.
+* `Point` -- a single location in space, given by its horizontal, then vertical coordinates. That's an easy convention to swallow when you think in terms `x`, `y` -- but also means you should always list coordinates in the order longitude first then latitude. Get in the habit of always using that ordering.
+* `LineString` -- a single continuous path, described as an ordered sequence of points. To describe a closed path, repeat the line's start point as its end point. A path is 'simple' if it does not cross or touch itself; a path is a 'ring' if it is both simple and closed.
+* `Polygon`  -- a connected surface in space, described by at least one closed simple path defining its exterior, and zero one or many non-crossing rings defining any interior holes. The exterior ring is always listed first, and no ring is permitted to cross or touch itself or any other ring.
+* `MultiPoint` -- a collection of points regarded as a single shape. 
+* `MultiLineString` -- a collection of lines regarded as a single shape. Although a `Polygon` also has multiple chains of coordinates, a `Polygon` is not a `MultiLineString`. Most importantly, a `Polygon` represent a 2-D shape with an interior; a `MultiLineString` represents a collection of 1-D shapes. What's more, the line strings defining a polygon must be non-intersecting rings, while the elements of a `LineString` or `MultiLineString` are permitted to be either open or closed, and may cross or touch.
+* `MultiPolygon` -- you guessed it, a collection of polygons regarded as a single shape. These polygons are allowed to overlap, lie within each other, or anything else they want to do. 
+* `Envelope` -- an axis-aligned rectangle depicting the minimum and maximum extent of a shape in each coordinate. Since its sides are aligned with the axes, we only have to give the coordinates of two of its corners. From the perspective of the geometry libraries this does not live in the same type hierarchy as the geometry objects above, but it's easy enough to generate the polygon corresponding to an envelope or the envelope of any shape. Any time you're specifying a bounding box, follow the convention of numerically-lowest-coordinates then numerically-highest-coordinates, i.e. `( (min_x, min_y), (max_x, max_y) )`. Like the longitude-then-latitude convention, it's violated just often enough to drive you crazy.
 
-Those are the essential data types used by the underlying geospatial methods. However, when adapting geospatial methods to Pig there are really only two families of shapes to consider:
+Those are the essential data types used by geospatial libraries everywhere. However, when adapting geospatial methods to Pig there are really only two families of shapes to consider:
 
 * Points, which lack spatial extent
 * Regions (i.e. all geometries that are not of type `Point`), which span more than one location in space
@@ -78,7 +79,7 @@ Somewhat related are operations that bring shapes in and out of Pig's control.
 // * Curve, Surface, MultiCurve, MultiSurface, GeomCollection, Geometry
 // * M, Z / MaxZ / MaxM / MinM / MinZ
 
-==== Simple Properties of Shapes
+==== Properties of Shapes
 
 * `GeoArea(geom)`
 * `MinX(geom)`, `MinY(geom)`, `MaxX(geom)`, `MaxY(geom)` -- the numerically greatest and least extent of a shape in the specified dimension.
@@ -100,7 +101,7 @@ There are also a set of meta-operations that report on the geometry objects repr
 * `GeometryType(geom)` -- string representing the geometry type: `'Point'`, `'LineString'`, ..., `'MultiPolygon'`.
 * `IsGeomEmpty(geom)` -- 1 if the geometry contains no actual points.
 * `IsLineClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
-* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. (TODO: tasteful joke goes here.)
+* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide. 
 * `IsLineRing` -- 1 if the given `LineString` is a ring -- that is, closed and simple.
 
 * `NumGeometries(geom_collection)`
diff --git a/11c-spatial_aggregations_on_regions.asciidoc b/11c-spatial_aggregations_on_regions.asciidoc
@@ -4,8 +4,7 @@ While spatially aggregating pointwise data required nothing terribly sophisticat
 
 Fair warning: though the first version of this script we demonstrate for you is correct as it stands, it will be exceptionally inefficient. But we want to demonstate the problem and then discuss its elegant solution.
 
-   X points    Y gridcells occupied    ~ 6000 grid cells (50 x 120)
-
+// X points    Y gridcells occupied    ~ 6000 grid cells (50 x 120)
 
 The key first step is to assemble a table giving the weight of each region's contribution to the tile.
 
@@ -49,53 +48,61 @@ Joining that to the 2.3 million rows of crop production data produces big data f
 
 // If what follows doesn't set your heart singing, you might have chosen the wrong profession.
 
-Our solution will follow the same logic as the solution JT learned from the friendly pianist.
-Instead of breaking down
+// Our solution will follow the same logic as the solution JT learned from the friendly pianist.
 
-image::images/images/11-f-quad_decompositions/spatial_manor-rooms_peeps_grid.png[Spatial Manor - Rooms (and underlying ZL-3 grid)]
+Instead of breaking down
 
-image::images/images/11-f-quad_decompositions/spatial_manor-quadkeys.png[Spatial Manor - Decomposed]
+image::images/11-f-quad_decompositions/11-quaddecomp-world-mercator.png[Variable Level of Detail prevents Inefficient Tiny Tiles]
 
 The QuadDecompose UDF accepts an outer and an inner zoom level.
 Every region will be broken down to at least the coarser zoom level; you'll use this grid size as the partition key (in map/reduce) or group key (in Pig).
 No region will be broken into tiles smaller than the inner zoom level, which is important for both managing the data volume and for ensuring we don't group data into finer bins than it can support (see the section on "Choosing a Histogram Bin Size" (REF)).
 
 In the graphic (REF), you can see the result:
 the top two images, showing the tiles decomposed naively (on the left) and hierarchically (on the right), are identical. The lower row has the same data but with a light border around each tile in the heatmap. The large interior portions of Russia, China, the USA, Brazil and others are now represented by coarser zoom-level blocks. Even France and Argentina manage to line up their borders fortuitously for a zoom-level 6 (TODO check size) block. In all, we reduced the number of tiles from XXX down to XXX without impacting accuracy.
-
+// 
 TODO: move to join part below.
 
 Since every point on the map is covered by at most one region for the current data set, you'll find that every tile is either (a) wholly in the interior of a country, or (b) at the finest zoom level: there's no way for Russia to send data off to a huge zoom-level 5 tile while Latvia sends data to one of its zoom-level 7 children.
 The zoom-level chosen for each tile in the `FOREACH...QuadDecompose` pass was the correct granularity for the result.
+
 // TODO: reword
 
+=== Quadtree Decomposition and Numbering
 
+Let's look closer at the quadtree scheme.
 
-=== Projections and Tiling Schemes
+Our Reindeer friends, with their ample free time and bizarre sense of humor, sometimes vacation at a quaint vacation estate known as Spatial Manor. On the one hand, every once in a while somebody attacks a fellow guest with candlestick or rope or whatever's close at hand. On the other hand, though, most people _aren't_ killed and get to enjoy the rousing sport of solving a mystery!
 
+We can apply our spatial analysis tools to its simple geometry and help the investigators make a data-backed decision. Here is a map of the grounds.
 
-* Equal-area:
-  - features uniformly distributed on the globe will be uniformly distributed among grid cells.
-* Platte-Careé (Equirectangular)
-  - Extremely simple to compute
-  - Plot directly into screen coordinates with
-  -
+image::images/images/11-f-quad_decompositions/spatial_manor-rooms_peeps_grid.png[Spatial Manor - Rooms (and underlying ZL-3 grid)]
+
+image::images/images/11-f-quad_decompositions/spatial_manor-quadkeys.png[Spatial Manor - Decomposed]
+
+
+
+// === Projections and Tiling Schemes
+// 
+// 
+// * Equal-area:
+//   - features uniformly distributed on the globe will be uniformly distributed among grid cells.
+// * Platte-Careé (Equirectangular)
+//   - Extremely simple to compute
+//   - Plot directly into screen coordinates with
+//   -
 
 ==== Exporting data for Presentation by a Tileserver
 
-The most commonly
+// The most commonly
 
 Features following a constant bearing in any direction -- Manhattan's Broadway, or the borderlines of Algeria or Nevada -- remain straight lines on the map. This is important for navigational purposes
 
 The locality properties of quadtiles indexing
 
-
-
 Typically you will store the shape clipped to the given quad-tile.
  metadata about that region -- population, metric tons of bananas exported annually, an image of its flag, lyrics to its national anthem -- is stored under the same key but in independent columns footnote:[typically the regions are heavyweight, heavily requested and read-only, so they deserve their own table or at least their own column family.
 
-
-
 You don't have to break
 What we do is
 When tile 0231_1 is requested
diff --git a/images/11-circle_of_constant_distance.png b/images/11-circle_of_constant_distance.png
diff --git a/images/11-f-quad_decompositions/11-quaddecomp-world-mercator.png b/images/11-f-quad_decompositions/11-quaddecomp-world-mercator.png
diff --git a/images/Quadtiles-ClueRegions.graffle b/images/Quadtiles-ClueRegions.graffle
@@ -18075,7 +18075,7 @@
 			<key>Print</key>
 			<string>YES</string>
 			<key>View</key>
-			<string>NO</string>
+			<string>YES</string>
 		</dict>
 		<dict>
 			<key>Lock</key>
@@ -18160,7 +18160,7 @@
 	<key>MasterSheets</key>
 	<array/>
 	<key>ModificationDate</key>
-	<string>2014-07-30 06:39:19 +0000</string>
+	<string>2014-07-31 18:49:24 +0000</string>
 	<key>Modifier</key>
 	<string>Philip flip Kromer</string>
 	<key>NotesVisible</key>
@@ -18258,7 +18258,7 @@
 		<key>SidebarWidth</key>
 		<integer>120</integer>
 		<key>VisibleRegion</key>
-		<string>{{-363, 2}, {1331, 1049}}</string>
+		<string>{{-363, 1}, {1331, 1049}}</string>
 		<key>Zoom</key>
 		<real>1</real>
 		<key>ZoomValues</key>