You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 11a-geodata-intro.asciidoc
+36Lines changed: 36 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,10 @@
1
1
Having conquered time (or, at least, learned basics of timeseries analysis), let's move on to space.
2
2
Spatial analysis works with _data attached to points, paths and regions_, where the _fundamentally interesting relationships are among nearby objects_. This is some of our favorite material in the book for two reasons: because it's really useful, and because it will extend your understanding of map-reduce in a significant way.
3
3
4
+
// operations on data attached to shapes in the context of what is spatially nearby.
5
+
6
+
===== Spatial Analytics is a Useful Tool
7
+
4
8
Problems with a directly geographic aspect appear naturally in all sorts of human endeavors: "Where should I put this cell tower / oil well / coffee shop?". Analysis across billions of GPS paths on millions of routes not only allows a driving directions app to direct you to the correct lane when changing roads, it will power the self-driving cars of the (near?) future. http://www.slideshare.net/Hadoop_Summit/grailer-hochmuth-june27515pmroom212v3[Farmers are improving crop yields and reducing environmental impact] by combining data from suppliers, other growers, satellite feeds, and government agencies to fine-tune what strains they plant and what pests and diseases they act to prevent. Wind farm companies use petabytes of weather data to predict http://www.ibmbigdatahub.com/blog/lords-data-storm-vestas-and-ibm-win-big-data-award[optimize turbine locations and predict their yield].
5
9
It is an increasingly essential piece of any high-stakes effort in marketing, agriculture, security, politics, and any other field where location is a key variable.
6
10
@@ -10,6 +14,31 @@ To keep things concrete and relevant to the typical reader we're going to focus
10
14
11
15
// Taking a step back, the fundamental idea this chapter introduces is a direct way to extend locality to two dimensions. It so happens we did so in the context of geospatial data, and required a brief prelude about how to map our nonlinear feature space to the plane. Browse any of the open data catalogs (REF) or data visualization blogs, and you'll see that geographic datasets and visualizations are by far the most frequent. Partly this is because there are these two big obvious feature components, highly explanatory and direct to understand. But you can apply these tools any time you have a small number of dominant features and a sensible distance measure mapping them to a flat space.
12
16
17
+
===== Spatial Analytics is Good for Your Brain
18
+
19
+
The essential element of spatial analytics is to operate on shapes within context of what is spatially nearby. We can do so by coaching Hadoop to:
20
+
// even when the chain of ojects that are nearby is larger than
21
+
22
+
1. partition space into coarse-grained tiles
23
+
2. assign each tile to a reducer (equivalently, group each tile's objects into a bag)
24
+
3. ensure that everything potentially relevant for the objects on a tile finds its way there
25
+
4. eliminate any potentially-but-not-actually relevant objects and any duplicated results
26
+
27
+
Since a point has no extent, nominating its context is straightforward: send it to the tile it lives on. Spatially aggregating points on a tile requires exactly and only the occupants of that tile. As you'll see in our first example, that means it's no harder than the grouping operations you mastered back in Chapter 6 (REF).
28
+
29
+
In contrast, a multi-point, a line, or any other shape having spatial extent might cross one or many tiles, or even wholly contain them. It might have gaps and holes, might cross from 180 degrees longitude to -180 degrees longitude (or by covering one of the poles) might even span all 360 degrees of longitude. And while it's cheap to determine whether pairs of points or rectangles intersect, touch, lie within a given distance, or whatever other definition of "spatially relevant" is in play, you need to perform the corresponding operations on the complex shapes eventually, which can be quite expensive. That leads to why bullet point number 4 appears above, and why we think this chapter is so important for a deep understanding of orchestrating big data operations.
30
+
31
+
What we need to do is scatter each object to all the groups where it _might_ be relevant, knowing that (a) it might be grouped with objects that are not actually relevant, and (b) relevant operations may be duplicated within multiple groups.
32
+
A person standing in Lesotho is within the South Africa's bounding box (and so potentially relevant) but is not actually within South Africa (and so is not relevant for the "Within" relationship). Territories of the UK and France lie within 60 km of each other not only from Dover to Calais in Europe but also from Montserrat to Guadeloupe in the Caribbean.
33
+
34
+
The core spatial operations allow us to segment and .
35
+
The key new skill as map/reduce coaches is to do so correctly and efficiently: without expensive processing, without requiring context that might reside on some other machine, and without an explosion of midstream data, or proliferation of not-actually-relevant objects to consider, or an infection of indeterminately duplicate output records.
36
+
37
+
Those strategies aren't unique to the spatial case, and so what we're really trying to do in this chapter is equip you do deal with situations where determining which objects should be related in context is complex and can't be done locally. For example, one way to find pairs of similar documents involves deriving each document's most prominent keywords, then matching each document to all others that share a significant fraction of its keywords. In the spatial analytics case, our cascading love triangles from London to Paris to Lyon to Milan (REF (to E&C preamble)) threatened to crowd the whole world into a single dance hall. The document similarity case presents the same problem. A document mentioning Paris, Lyon, and Milan is a candidate match for one mentioning London+Paris+Lyon and for one mentioning Lyon+Milan+Rome -- but London+Paris+Lyon and Lyon+Milan+Rome don't need to be compared. We want to ensure the necessary pairings are considered, without allowing the chain of candidate matches from London+Paris+Lyon, through Lyon+Milan+Rome, on over to Busan+Osaka+Tokyo to land on the same library table. It's the same problem in different guise, and the essential intuition you build here will carry over.
38
+
39
+
40
+
===== What We'll Cover
41
+
13
42
We'll start by demonstrating spatial aggregations: counting the number of UFO sightings per area, or smoothing pointwise elevation estimates onto a grid.
14
43
It's one of the frontline tools of spatial analysis, a great way to summarize a large data set, and one of the first things you'll do to Know Thy Data. It's also a direct application of things you've already learned, giving us a chance to introduce some terminology and necessary details. Next is a gallery tour of the basic spatial operations: how to find a shape's area or bounding box, its intersection with another shape, and so forth. We won't spend much of your time here, as these are easy enough to apply once you know their names.
15
44
@@ -32,6 +61,13 @@ But when you want to map against the nearest object -- which might be a three-mi
32
61
Let's start off in the best way possible: with a tool for turning lots of data into manageable insight.
33
62
34
63
64
+
image::images/11a-france-uk-calais.png[The territories of France and the UK are close by in Europe]
65
+
image::images/11a-france-uk-caribbean.png[...and also in the Caribbean]
66
+
67
+
image::images/11a-south_africa-lesotho.png[South Africa contains Lesotho... Politics trumps Topology]
==== Smoothing Pointwise Data Locally (Spatial Aggregation of Points)
3
+
4
+
5
+
We will start, as we always do, by applying patterns that turn Big Data into Much a Less Data. In particular,
6
+
A great tool for visualizing a large spatial data set
7
+
8
+
9
+
* You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
10
+
* Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
11
+
*
12
+
*
13
+
* data reduction, especially for a heatmap visualization;
14
+
* extracting a continuous measurement from a pointwise sample;
15
+
* providing a common basis for comparison of multiple datasets;
16
+
* smoothing out spatial variation;
17
+
* for all the other reasons you aggregate groups of related values in context
18
+
* You have sampled data at points in order to estimate something with spatial extent. The weather dataset is an example:
19
+
* Data that manifests at a single point
20
+
represents a process with
21
+
For example, the number of airline passengers in and out of the major airport
22
+
are travelling to and from local destinations
23
+
* Smoothing pointwise data
24
+
into a
25
+
easier to compare or manage
26
+
* continuous approximation
27
+
represents just the variation due to spatial
28
+
variables
29
+
30
+
The straightforward approach we'll take is to divide the world up into a grid of tiles and map the position of each point onto the unique grid tile it occupies. We can then group on each tile
31
+
32
+
Area of a spherical segment is 2*pi*R*h --
33
+
so for lat from equator to 60
34
+
35
+
------
36
+
%default binsz 2.0
37
+
-- place into half-degree bins -- ~ 120x50 cells for US
38
+
gridded = FOREACH sightings GENERATE
39
+
FLOOR(lng * $binsz) / $binsz AS bin_x,
40
+
FLOOR(lat * $binsz) / $binsz AS bin_y;
41
+
-- number density
42
+
grid_cts = FOREACH (GROUP gridded BY (bin_x, bin_y))
43
+
GENERATE
44
+
group.bin_x, group.bin_y,
45
+
COUNT_STAR(gridded) AS ct;
46
+
------
47
+
48
+
* US: -125 24 to -66, 50 (-124.7625, 24.5210, -66.9326, 49.3845) -- about 60 x 26
49
+
50
+
==== Creating a Spatial Density Map
51
+
52
+
Map points to quad cells, plot number density of airports as a heat map
53
+
54
+
Then geonames places -- show lakes and streams (or something nature-y) vs something urban-y
55
+
56
+
(just call out that rollup, summing trick, or group-decorate-flatten would work: do no pursue)
57
+
58
+
Do that again, but for a variable: airport flight volume -- researching
59
+
epidemiology
60
+
61
+
// FAA flight data http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/media/cy07_primary_np_comm.pdf
62
+
63
+
We can plot the number of air flights handled by every airport
64
+
65
+
------
66
+
%default binsz 2.0
67
+
-- place into half-degree bins -- ~ 120x50 cells for US
68
+
gridded = FOREACH sightings GENERATE
69
+
FLOOR(lng * $binsz) / $binsz AS bin_x,
70
+
FLOOR(lat * $binsz) / $binsz AS bin_y,
71
+
n_flights;
72
+
-- number density
73
+
grid_cts = FOREACH (GROUP gridded BY (bin_x, bin_y))
74
+
GENERATE
75
+
group.bin_x, group.bin_y,
76
+
COUNT_STAR(gridded) AS ct,
77
+
SUM(n_flights) AS tot_flights;
78
+
------
79
+
80
+
An epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network
81
+
82
+
===== Pattern Recap: Spatial Aggregation of Points
83
+
84
+
* _Generic Example_ -- group on tile cell, then apply the appropriate aggregation function
85
+
* _When You'll Use It_ -- as mentioned above: summarizing data; converting point samples into a continuous value; smoothing out spatial variation; reassigning spatial data to grid-aligned regions
86
+
* _Exercises_ --
87
+
* _Important to Know_ --
88
+
- A https://en.wikipedia.org/wiki/Dot_distribution_map[Dot Distribution Map] is in some sense the counterpart to a spatial average -- turning data over a region into data at synthesized points
0 commit comments