Skip to content

Commit 24f39b8

Browse files
author
Philip (flip) Kromer
committed
intro improved a bit
1 parent 9446ae7 commit 24f39b8

File tree

1 file changed

+10
-16
lines changed

1 file changed

+10
-16
lines changed

11a-geodata-intro.asciidoc

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -36,31 +36,24 @@ The key new skill as map/reduce coaches is to do so correctly and efficiently: w
3636

3737
Those strategies aren't unique to the spatial case, and so what we're really trying to do in this chapter is equip you do deal with situations where determining which objects should be related in context is complex and can't be done locally. For example, one way to find pairs of similar documents involves deriving each document's most prominent keywords, then matching each document to all others that share a significant fraction of its keywords. In the spatial analytics case, our cascading love triangles from London to Paris to Lyon to Milan (REF (to E&C preamble)) threatened to crowd the whole world into a single dance hall. The document similarity case presents the same problem. A document mentioning Paris, Lyon, and Milan is a candidate match for one mentioning London+Paris+Lyon and for one mentioning Lyon+Milan+Rome -- but London+Paris+Lyon and Lyon+Milan+Rome don't need to be compared. We want to ensure the necessary pairings are considered, without allowing the chain of candidate matches from London+Paris+Lyon, through Lyon+Milan+Rome, on over to Busan+Osaka+Tokyo to land on the same library table. It's the same problem in different guise, and the essential intuition you build here will carry over.
3838

39-
4039
===== What We'll Cover
4140

42-
We'll start by demonstrating spatial aggregations: counting the number of UFO sightings per area, or smoothing pointwise elevation estimates onto a grid.
43-
It's one of the frontline tools of spatial analysis, a great way to summarize a large data set, and one of the first things you'll do to Know Thy Data. It's also a direct application of things you've already learned, giving us a chance to introduce some terminology and necessary details. Next is a gallery tour of the basic spatial operations: how to find a shape's area or bounding box, its intersection with another shape, and so forth. We won't spend much of your time here, as these are easy enough to apply once you know their names.
44-
45-
Spatially aggregating points onto a grid is straightforward because each point only provides relevant context to the single grid cell it occupies.
46-
47-
Gather together each object with every relevant
48-
Nearby shape
49-
Without ever accumulating a lopsided share of objects into the same group.
41+
We'll start by demonstrating spatial aggregations on points: for example, counting the number of UFO sightings per area over a regular grid. This type of aggregation is a frontline tool of spatial analysis, a great way to summarize a large data set, and one of the first things you'll do to Know Thy Data. It draws on methods you've already learned, giving us a chance to introduce some terminology and necessary details. Then we'll go from spatial grouping of a single dataset to demonstrate a point-wise spatial join: matching points in a table with all nearby points in another table.
5042

51-
Spatial aggregations on regions -- e.g. smoothing county-by-county data onto a uniform grid -- demands a bit more subtlety than aggregations of points. In places where multiple regions appear you must weight their contributions correctly, and in places where only one region is present you'd like to avoid extra work. We handle that using a quad-tree tiling (aka "quadtiles"), a superbly elegant tool for partitioning spatially-nearby objects onto common reducers with minimal waste.
43+
Next is a gallery tour of the basic spatial operations: how to find a shape's area or bounding box, its intersection with another shape, and so forth. We won't spend much of your time here, but it's worth having something to refer to. The real fun comes as the notion of "nearby" becomes less and less predictable in advance.
5244

53-
The real fun comes as the notion of "nearby" becomes less and less predictable in advance.
45+
Spatial aggregations on regions -- e.g. smoothing country-by-country crop production data onto a uniform grid -- demands more subtlety than aggregations of points. In places where multiple regions appear, you must weight their contributions correctly, and in places where only one region is present you'd like to avoid extra work. We handle that using a quad-tree tiling (aka "quadtiles"), a superbly elegant tool for partitioning spatially-nearby objects onto common reducers with minimal waste.
5446

55-
Spatial join of shapes
47+
The quadtile scheme not only helps to partition the data efficiently but also orders it so that spatially-nearby objects are generally nearby in quadtile order. That lets us perform a spatial join of points with _regions_ using nothing more than Hadoop's native sort and a low-memory-overhead data structure. This is the key material in the chapter, and we'll step through it in detail.
5648

49+
Lastly, we'll demonstrate how to handle the case where it's difficult in advance to even know how to spatially partition data.
50+
You see, one way to determine which records are spatially relevant -- the one all the operations to that point in the chapter will have used -- is to define a fixed distance and only relate objects within that nearby-ness threshold.
5751
When you're matching shapes with objects less than a certain distance away,
5852
You can set an upper bound in advance on what "nearby" means: the specified distance.
59-
But when you want to map against the nearest object -- which might be a three-minute walk or might be a three-day sail -- things get more complicated. There's a wonderful trick for doing these kinds of "nearest object" queries without overwhelming your cluster. We'll use it to combine our record of every baseball game played against the historical weather data, and learn an important truth about truth and error along the way.
53+
But when you want to map against not "_any nearby_" objects but against "_the nearest_" object -- which might be a three-minute walk or might be a three-day sail -- things get more complicated. There's a wonderful trick for doing these kinds of "nearest object" queries without overwhelming your cluster. We'll use it to combine our record of every baseball game played against the historical weather data, and learn an important truth about truth and error along the way.
6054

6155
Let's start off in the best way possible: with a tool for turning lots of data into manageable insight.
6256

63-
6457
.The territories of France and the UK are close by in Europe
6558
image::images/11a-france-uk-calais.png[height=120]
6659

@@ -70,15 +63,16 @@ image::images/11a-france-uk-caribbean.png[height=120]
7063
.South Africa contains Lesotho: Politics trumps Topology
7164
image::images/11a-south_africa-lesotho.png[height=150]
7265

73-
74-
7566
// Features of Features
7667
// [NOTE]
7768
// ===============================
7869
// The term "feature" is somewhat muddied -- to a geographer, "feature" indicates a _thing_ being described (places, regions, paths are all geographic features). In the machine learning literature, "feature" describes a potentially-significant _attribute_ of a data element (manufacturer, top speed and weight are features of a car). Since we're here as data scientists dabbling in geography, we'll reserve the term "feature" for only its machine learning sense.
7970
// ===============================
8071

8172

73+
// Spatially aggregating points onto a grid is straightforward because each point only provides relevant context to the single grid cell it occupies.
74+
// Gather together each object with every relevant Nearby shape
75+
// Without ever accumulating a lopsided share of objects into the same group.
8276

8377
// * Geometry is hard to do _right_
8478
// * Pretending the bumpy kinda-ellipsoid is a simple rectangle.

0 commit comments

Comments
 (0)