You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 11b-spatial_aggregation-points.asciidoc
+31-20Lines changed: 31 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,26 +6,26 @@ We will start, as we always do, by applying patterns that turn Big Data into Muc
6
6
A great tool for visualizing a large spatial data set
7
7
8
8
9
-
* You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
10
-
* Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
11
-
*
12
-
*
13
-
* data reduction, especially for a heatmap visualization;
14
-
* extracting a continuous measurement from a pointwise sample;
15
-
* providing a common basis for comparison of multiple datasets;
16
-
* smoothing out spatial variation;
17
-
* for all the other reasons you aggregate groups of related values in context
18
-
* You have sampled data at points in order to estimate something with spatial extent. The weather dataset is an example:
19
-
* Data that manifests at a single point
20
-
represents a process with
21
-
For example, the number of airline passengers in and out of the major airport
22
-
are travelling to and from local destinations
23
-
* Smoothing pointwise data
24
-
into a
25
-
easier to compare or manage
26
-
* continuous approximation
27
-
represents just the variation due to spatial
28
-
variables
9
+
// * You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
10
+
// * Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
11
+
// *
12
+
// *
13
+
// * data reduction, especially for a heatmap visualization;
14
+
// * extracting a continuous measurement from a pointwise sample;
15
+
// * providing a common basis for comparison of multiple datasets;
16
+
// * smoothing out spatial variation;
17
+
// * for all the other reasons you aggregate groups of related values in context
18
+
// * You have sampled data at points in order to estimate something with spatial extent. The weather dataset is an example:
19
+
// * Data that manifests at a single point
20
+
// represents a process with
21
+
// For example, the number of airline passengers in and out of the major airport
22
+
// are travelling to and from local destinations
23
+
// * Smoothing pointwise data
24
+
// into a
25
+
// easier to compare or manage
26
+
// * continuous approximation
27
+
// represents just the variation due to spatial
28
+
// variables
29
29
30
30
The straightforward approach we'll take is to divide the world up into a grid of tiles and map the position of each point onto the unique grid tile it occupies. We can then group on each tile
31
31
@@ -86,3 +86,14 @@ An epidemiologist or transportation analyst interested in knowing the large-scal
86
86
* _Exercises_ --
87
87
* _Important to Know_ --
88
88
- A https://en.wikipedia.org/wiki/Dot_distribution_map[Dot Distribution Map] is in some sense the counterpart to a spatial average -- turning data over a region into data at synthesized points
89
+
90
+
91
+
=== Matching Points within a Given Distance (Pointwise Spatial Join)
92
+
93
+
94
+
95
+
==== Distance isn't as it seems
96
+
97
+
The picture below shows a circle, 350 km in radius, centered at 60 degrees latitude (up near Helsinki). What you should see that the lines of constant longitude "come together" faster than the curve of the circle does. This is most
98
+
99
+
image::images/11-circle_of_constant_distance.png[Min/Max Longitudes are not at the same latitude as the center]
Copy file name to clipboardExpand all lines: 11c-geospatial_mechanics.asciidoc
+11-10Lines changed: 11 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,20 @@
1
1
=== Mechanics of Spatial Data
2
2
3
-
We kicked off the chapter with an interesting example that didn't require too many new concepts, but it's time to backtrack a bit and properly cover the mechanics of working with spatial data.
3
+
We kicked off the chapter with two examples that didn't require too many new concepts, but it's time to backtrack a bit and properly cover the mechanics of working with spatial data.
4
4
5
5
The data types and operations are extremely well standardized by the http://www.opengeospatial.org/standards/sfa[Open Geospatial Consortium]. Nearly all of the operations below have identical behavior within Oracle, PostGIS, SQL Server, and all industrial-strength geospatial systems. In fact, the geospatial toolkits for Pig (Pigeon) and Hive (Esri-SFFH) are particularly sympatico as they both use Esri's wonderful https://github.com/Esri/geometry-api-java[Esri Geometry API] under the hood.
6
6
7
7
==== Spatial Data Types
8
8
9
-
* `Point` --
10
-
* `LineString` --
11
-
* `Polygon` --
12
-
* `MultiPoint` --
13
-
* `MultiLineString` -- Although a `Polygon` also has multiple chains of coordinates, a `Polygon` is not a `MultiLineString`. Most importantly, a `Polygon` represent a 2-D shape, and the coordinates delimit the continuous dense set of points in its interior; a `MultiLineString` represents a 1-D shape, and the coordinates delimit the endpoints of its line segments. What's more, the line strings defining a polygon must be 'rings': closed paths that do not cross or touch; the elements of a `LineString` or `MultiLineString` are allowed to be either open or closed and may cross or touch.
14
-
* `MultiPolygon` -- you guessed it, a collection of polygons. They're allowed to overlap, lie within each other, or anything else they want to do.
9
+
* `Point` -- a single location in space, given by its horizontal, then vertical coordinates. That's an easy convention to swallow when you think in terms `x`, `y` -- but also means you should always list coordinates in the order longitude first then latitude. Get in the habit of always using that ordering.
10
+
* `LineString` -- a single continuous path, described as an ordered sequence of points. To describe a closed path, repeat the line's start point as its end point. A path is 'simple' if it does not cross or touch itself; a path is a 'ring' if it is both simple and closed.
11
+
* `Polygon` -- a connected surface in space, described by at least one closed simple path defining its exterior, and zero one or many non-crossing rings defining any interior holes. The exterior ring is always listed first, and no ring is permitted to cross or touch itself or any other ring.
12
+
* `MultiPoint` -- a collection of points regarded as a single shape.
13
+
* `MultiLineString` -- a collection of lines regarded as a single shape. Although a `Polygon` also has multiple chains of coordinates, a `Polygon` is not a `MultiLineString`. Most importantly, a `Polygon` represent a 2-D shape with an interior; a `MultiLineString` represents a collection of 1-D shapes. What's more, the line strings defining a polygon must be non-intersecting rings, while the elements of a `LineString` or `MultiLineString` are permitted to be either open or closed, and may cross or touch.
14
+
* `MultiPolygon` -- you guessed it, a collection of polygons regarded as a single shape. These polygons are allowed to overlap, lie within each other, or anything else they want to do.
15
+
* `Envelope` -- an axis-aligned rectangle depicting the minimum and maximum extent of a shape in each coordinate. Since its sides are aligned with the axes, we only have to give the coordinates of two of its corners. From the perspective of the geometry libraries this does not live in the same type hierarchy as the geometry objects above, but it's easy enough to generate the polygon corresponding to an envelope or the envelope of any shape. Any time you're specifying a bounding box, follow the convention of numerically-lowest-coordinates then numerically-highest-coordinates, i.e. `( (min_x, min_y), (max_x, max_y) )`. Like the longitude-then-latitude convention, it's violated just often enough to drive you crazy.
15
16
16
-
Those are the essential data types used by the underlying geospatial methods. However, when adapting geospatial methods to Pig there are really only two families of shapes to consider:
17
+
Those are the essential data types used by geospatial libraries everywhere. However, when adapting geospatial methods to Pig there are really only two families of shapes to consider:
17
18
18
19
* Points, which lack spatial extent
19
20
* Regions (i.e. all geometries that are not of type `Point`), which span more than one location in space
@@ -78,7 +79,7 @@ Somewhat related are operations that bring shapes in and out of Pig's control.
* `IsGeomEmpty(geom)` -- 1 if the geometry contains no actual points.
102
103
* `IsLineClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
103
-
* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. (TODO: tasteful joke goes here.)
104
+
* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.
104
105
* `IsLineRing` -- 1 if the given `LineString` is a ring -- that is, closed and simple.
Copy file name to clipboardExpand all lines: 11c-spatial_aggregations_on_regions.asciidoc
+26-19Lines changed: 26 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,8 +4,7 @@ While spatially aggregating pointwise data required nothing terribly sophisticat
4
4
5
5
Fair warning: though the first version of this script we demonstrate for you is correct as it stands, it will be exceptionally inefficient. But we want to demonstate the problem and then discuss its elegant solution.
6
6
7
-
X points Y gridcells occupied ~ 6000 grid cells (50 x 120)
8
-
7
+
// X points Y gridcells occupied ~ 6000 grid cells (50 x 120)
9
8
10
9
The key first step is to assemble a table giving the weight of each region's contribution to the tile.
11
10
@@ -49,53 +48,61 @@ Joining that to the 2.3 million rows of crop production data produces big data f
49
48
50
49
// If what follows doesn't set your heart singing, you might have chosen the wrong profession.
51
50
52
-
Our solution will follow the same logic as the solution JT learned from the friendly pianist.
53
-
Instead of breaking down
51
+
// Our solution will follow the same logic as the solution JT learned from the friendly pianist.
image::images/11-f-quad_decompositions/11-quaddecomp-world-mercator.png[Variable Level of Detail prevents Inefficient Tiny Tiles]
58
56
59
57
The QuadDecompose UDF accepts an outer and an inner zoom level.
60
58
Every region will be broken down to at least the coarser zoom level; you'll use this grid size as the partition key (in map/reduce) or group key (in Pig).
61
59
No region will be broken into tiles smaller than the inner zoom level, which is important for both managing the data volume and for ensuring we don't group data into finer bins than it can support (see the section on "Choosing a Histogram Bin Size" (REF)).
62
60
63
61
In the graphic (REF), you can see the result:
64
62
the top two images, showing the tiles decomposed naively (on the left) and hierarchically (on the right), are identical. The lower row has the same data but with a light border around each tile in the heatmap. The large interior portions of Russia, China, the USA, Brazil and others are now represented by coarser zoom-level blocks. Even France and Argentina manage to line up their borders fortuitously for a zoom-level 6 (TODO check size) block. In all, we reduced the number of tiles from XXX down to XXX without impacting accuracy.
65
-
63
+
//
66
64
TODO: move to join part below.
67
65
68
66
Since every point on the map is covered by at most one region for the current data set, you'll find that every tile is either (a) wholly in the interior of a country, or (b) at the finest zoom level: there's no way for Russia to send data off to a huge zoom-level 5 tile while Latvia sends data to one of its zoom-level 7 children.
69
67
The zoom-level chosen for each tile in the `FOREACH...QuadDecompose` pass was the correct granularity for the result.
68
+
70
69
// TODO: reword
71
70
71
+
=== Quadtree Decomposition and Numbering
72
72
73
+
Let's look closer at the quadtree scheme.
73
74
74
-
=== Projections and Tiling Schemes
75
+
Our Reindeer friends, with their ample free time and bizarre sense of humor, sometimes vacation at a quaint vacation estate known as Spatial Manor. On the one hand, every once in a while somebody attacks a fellow guest with candlestick or rope or whatever's close at hand. On the other hand, though, most people _aren't_ killed and get to enjoy the rousing sport of solving a mystery!
75
76
77
+
We can apply our spatial analysis tools to its simple geometry and help the investigators make a data-backed decision. Here is a map of the grounds.
76
78
77
-
* Equal-area:
78
-
- features uniformly distributed on the globe will be uniformly distributed among grid cells.
// - features uniformly distributed on the globe will be uniformly distributed among grid cells.
90
+
// * Platte-Careé (Equirectangular)
91
+
// - Extremely simple to compute
92
+
// - Plot directly into screen coordinates with
93
+
// -
83
94
84
95
==== Exporting data for Presentation by a Tileserver
85
96
86
-
The most commonly
97
+
// The most commonly
87
98
88
99
Features following a constant bearing in any direction -- Manhattan's Broadway, or the borderlines of Algeria or Nevada -- remain straight lines on the map. This is important for navigational purposes
89
100
90
101
The locality properties of quadtiles indexing
91
102
92
-
93
-
94
103
Typically you will store the shape clipped to the given quad-tile.
95
104
metadata about that region -- population, metric tons of bananas exported annually, an image of its flag, lyrics to its national anthem -- is stored under the same key but in independent columns footnote:[typically the regions are heavyweight, heavily requested and read-only, so they deserve their own table or at least their own column family.
0 commit comments