Skip to content

Commit 9446ae7

Browse files
author
Philip (flip) Kromer
committed
moving out old prose
1 parent 097cefc commit 9446ae7

File tree

4 files changed

+248
-279
lines changed

4 files changed

+248
-279
lines changed

11b-spatial_aggregation-points.asciidoc

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,4 +182,16 @@ There are still other details to attend to -- the box could cross over the antim
182182
image::images/11-circle_of_constant_distance.png[Min/Max Longitudes are not at the same latitude as the center]
183183

184184

185-
===
185+
186+
// +---------+---------+
187+
// | B |
188+
// | |
189+
// | |
190+
// | |
191+
// + A +
192+
// | | C
193+
// | |
194+
// | |
195+
// | | B is nearby A; C is not. Sorry, C.
196+
// +---------+---------+
197+
// |- 10 km -|

11d-quadtiles.asciidoc

Lines changed: 3 additions & 239 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
1-
21
==== The Quadtile Grid System ====
32

4-
We'll start by adopting the simple, flat Mercator projection -- directly map longitude and latitude to (X,Y). This makes geographers cringe, because of its severe distortion at the poles, but its computational benefits are worth it. footnote:[Two guides for which map projection to choose: http://www.radicalcartography.net/?projectionref http://xkcd.com/977/ . As you proceed to finer and finer zoom levels the projection distortion becomes less and less relevant, so the simplicity of Mercator or Equirectangular are appealing.]
3+
// We'll start by adopting the simple, flat Mercator projection -- directly map longitude and latitude to (X,Y). This makes geographers cringe, because of its severe distortion at the poles, but its computational benefits are worth it. footnote:[Two guides for which map projection to choose: http://www.radicalcartography.net/?projectionref http://xkcd.com/977/ . As you proceed to finer and finer zoom levels the projection distortion becomes less and less relevant, so the simplicity of Mercator or Equirectangular are appealing.]
54

6-
Now divide the world into four and make a Z pattern across them:
5+
The quadtile trick is to the world into four and make a Z pattern across them:
76

87
Within each of those, make a <<z_path_of_quadtiles, Z again>>:
98

@@ -21,131 +20,6 @@ This is a 1-d index into a 2-d space! What's more, nearby points in space are ty
2120

2221
Note: you'll sometimes see people refer to quadtile coordinates as `X/Y/Z` or `Z/X/Y`; the 'Z' here refers to zoom level, not a traditional third coordinate.
2322

24-
==== Patterns in UFO Sightings ====
25-
26-
Let's put Hadoop into practice for something really important: understanding where a likely alien invasion will take place. The National UFO Reporting Center has compiled a dataset of 60,000+ documented UFO sightings, with metadata. We can combine that with the 7 million labelled points of interest in the Geonames dataset: airports and zoos, capes to craters, schools, churches and more.
27-
28-
Going in to this, I predict that UFO sightings will generally follow the population distribution (because you need people around to see them) but that sightings in cities will be under-represented per capita. I also suspect UFO sightings will be more likely near airports and military bases, and in the southwestern US. We will restrict attention only to the continental US; coverage of both datasets is spotty elsewhere, which will contaminate our results.
29-
30-
Looking through some weather reports, visibilities of ten to fifteen kilometers (6-10 miles) are a reasonable midrange value; let's use that distance to mean "nearby". Given this necessarily-fuzzy boundary, let's simplify matters further by saying two objects are nearby if one point lies within the 20-km-per-side bounding box centered on the other:
31-
32-
33-
+---------+---------+
34-
| B |
35-
| |
36-
| |
37-
| |
38-
+ A +
39-
| | C
40-
| |
41-
| |
42-
| | B is nearby A; C is not. Sorry, C.
43-
+---------+---------+
44-
|- 10 km -|
45-
46-
==== Mapper: dispatch objects to rendezvous at quadtiles ====
47-
48-
What we will do is partition the world by quadtile, and ensure that each candidate pair of points arrives at the same quadtile.
49-
50-
Our mappers will send the highly-numerous geonames points directly to their quadtile, where they will wait individually. But we can't send each UFO sighting only to the quadtile it sits on: it might be nearby a place on a neighboring tile.
51-
52-
If the quadtiles are always larger than our nearbyness bounding box, then it's enough to just look at each of the four corners of our bounding box; all candidate points for nearbyness must live on the 1-4 quadtiles those corners touch. Consulting the geodata ready reference (TODO: ref) later in the book, zoom level 11 gives a grid size of 13-20km over the continental US, so it will serve.
53-
54-
So for UFO points, we will use the `bbox_for_radius` helper to get the left-top and right-bottom points, convert each to quadtile id's, and emit the unique 1-4 tiles the bounding box covers.
55-
56-
Example values:
57-
58-
longitude latitude left top right bottom nw_tile_id se_tile_id
59-
... ...
60-
... ...
61-
62-
Data is cheap and code is expensive, so for these 60,000 points we'll just serialize out the bounding box coordinates with each record rather than recalculate them in the reducer. We'll discard most of the UFO sightings fields, but during development let's keep the location and time fields in so we can spot-check results.
63-
64-
// Mapper output:
65-
66-
==== Reducer: combine objects on each quadtile ====
67-
68-
The reducer is now fairly simple. Each quadtile will have a handful of UFO sightings, and a potentially large number of geonames places to test for nearbyness. The nearbyness test is straightforward:
69-
70-
# from wukong/geo helpers
71-
72-
class BoundingBox
73-
def contains?(obj)
74-
( (obj.longitude >= left) && (obj.latitude <= top) &&
75-
(obj.longitude <= right) && (obj.latitude >= btm)
76-
end
77-
end
78-
79-
# nearby_ufos.rb
80-
81-
class NearbyReducer
82-
83-
def process_group(group)
84-
# gather up all the sightings
85-
sightings = []
86-
group.gather(UfoSighting) do |sighting|
87-
sightings << sighting
88-
end
89-
# the remaining records are places
90-
group.each do |place|
91-
sighted = false
92-
sightings.each do |sighting|
93-
if sighting.contains?(place)
94-
sighted = true
95-
yield combined_record(place, sighting)
96-
end
97-
end
98-
yield unsighted_record(place) if not sighted
99-
end
100-
end
101-
102-
def combined_record(place, sighting)
103-
(place.to_tuple + [1] + sighting.to_tuple)
104-
end
105-
def unsighted_record(place)
106-
place.to_tuple + [0]
107-
end
108-
end
109-
110-
For now I'm emitting the full place and sighting record, so we can see what's going on. In a moment we will change the `combined_record` method to output a more disciplined set of fields.
111-
112-
Output data:
113-
114-
...
115-
116-
==== Comparing Distributions ====
117-
118-
We now have a set of `[place, sighting]` pairs, and we want to understand how the distribution of coincidences compares to the background distribution of places.
119-
120-
(TODO: don't like the way I'm currently handling places near multiple sightings)
121-
122-
That is, we will compare the following quantities:
123-
124-
count of sightings
125-
count of features
126-
for each feature type, count of records
127-
for each feature type, count of records near a sighting
128-
129-
The dataset at this point is small enough to do this locally, in R or equivalent; but if you're playing along at work your dataset might not be. So let's use pig.
130-
131-
place_sightings = LOAD "..." AS (...);
132-
133-
features = GROUP place_sightings BY feature;
134-
135-
feature_stats = FOREACH features {
136-
sighted = FILTER place_sightings BY sighted;
137-
GENERATE features.feature_code,
138-
COUNT(sighted) AS sighted_count,
139-
COUNT_STAR(sighted) AS total_count
140-
;
141-
};
142-
143-
STORE feature_stats INTO '...';
144-
145-
results:
146-
147-
... TODO move results over from cluster ...
148-
14923
[[quadkey]]
15024
=== Quadtile Practicalities ===
15125

@@ -187,7 +61,6 @@ It's straightforward to calculate tile_x indices from the longitude (because all
18761

18862
Finding the Y tile index requires a slightly more complicated formula:
18963

190-
19164
This makes each grid cell be an increasingly better locally-flat approximation to the earth's surface, palliating the geographers anger at our clumsy map projection.
19265

19366
In code:
@@ -219,17 +92,9 @@ In code:
21992

22093
[NOTE]
22194
=========================
222-
Take care to put coordinates in the order "longitude, latitude", maintaining consistency with the (X, Y) convention for regular points. Natural english idiom switches their order, a pernicious source of error -- but the convention in http://www.geojson.org/geojson-spec.html#positions[geographic systems] is unambiguously to use `x, y, z` ordering. Also, don't abbreviate longitude as `long` -- it's a keyword in pig and other languages. I like `lng`.
95+
Take care to put coordinates in the order "longitude, latitude", maintaining consistency with the (X, Y) convention for regular points. Natural english idiom switches their order, a pernicious source of error -- but the convention in http://www.geojson.org/geojson-spec.html#positions[geographic systems] is unambiguously to use `x, y, z` ordering. Also, don't abbreviate longitude as `long` -- it's a keyword in pig and other languages. We like `lng`.
22396
=========================
22497

225-
==== Exploration
226-
227-
* _Exemplars_
228-
- Tokyo
229-
- San Francisco
230-
- The Posse East Bar in Austin, TX footnote:[briefly featured in the Clash's Rock the Casbah Video and where much of this book was written]
231-
232-
23398
==== Interesting quadtile properties ====
23499

235100
* The quadkey's length is its zoom level.
@@ -262,38 +127,6 @@ Take care to put coordinates in the order "longitude, latitude", maintaining con
262127

263128
A 64-bit quadkey -- corresponding to zoom level 32 -- has an accuracty of better than 1 cm over the entire globe. In some intensive database installs, rather than storing longitude and latitude separately as floating-point numbers, consider storing either the interleaved packed quadkey, or the individual 32-bit tile ids as your indexed value. The performance impact for Hadoop is probably not worth it, but for a database schema it may be.
264129

265-
===== Quadkey to and from Longitude/Latitude =====
266-
267-
# converts from even/odd state of tile x and tile y to quadkey. NOTE: bit order means y, x
268-
BIT_TO_QUADKEY = { [false, false] => "0", [false, true] => "1", [true, false] => "2", [true, true] => "3", }
269-
# converts from quadkey char to bits. NOTE: bit order means y, x
270-
QUADKEY_TO_BIT = { "0" => [0,0], "1" => [0,1], "2" => [1,0], "3" => [1,1]}
271-
272-
# Convert from tile x,y into a quadkey at a specified zoom level
273-
def tile_xy_zl_to_quadkey(tile_x, tile_y, zl)
274-
quadkey_chars = []
275-
tx = tile_x.to_i
276-
ty = tile_y.to_i
277-
zl.times do
278-
quadkey_chars.push BIT_TO_QUADKEY[[ty.odd?, tx.odd?]] # bit order y,x
279-
tx >>= 1 ; ty >>= 1
280-
end
281-
quadkey_chars.join.reverse
282-
end
283-
284-
# Convert a quadkey into tile x,y coordinates and level
285-
def quadkey_to_tile_xy_zl(quadkey)
286-
raise ArgumentError, "Quadkey must contain only the characters 0, 1, 2 or 3: #{quadkey}!" unless quadkey =~ /\A[0-3]*\z/
287-
zl = quadkey.to_s.length
288-
tx = 0 ; ty = 0
289-
quadkey.chars.each do |char|
290-
ybit, xbit = QUADKEY_TO_BIT[char] # bit order y, x
291-
tx = (tx << 1) + xbit
292-
ty = (ty << 1) + ybit
293-
end
294-
[tx, ty, zl]
295-
end
296-
297130
=== Quadtile Ready Reference ===
298131

299132
image::images/quadkey_ref-zoom_levels.png[Quadtile properties and data storage sizes by zoom level]
@@ -305,72 +138,3 @@ image::images/quadkey_ref-world_cities.png[Quadtile Properties for major world c
305138
The (ref table) gives the full coordinates at every zoom level for our exemplar set.
306139

307140
image::images/quadkey_ref-full_props-by_zl.png[Coordinates at every zoom level for some exemplars]
308-
309-
310-
==== Working with paths ====
311-
312-
The _smallest tile that fully encloses a set of points_ is given by the tile with the largest common quadtile prefix. For example, the University of Texas (quad `0231_3012_0331_1131`) and my office (quad `0231_3012_0331_1211`) are covered by the tile `0231_3012_0331_1`.
313-
314-
image::images/fu05-geographic-path-hq-to-ut.png[Path from Chimp HQ to UT campus]
315-
316-
When points cross major tile boundaries, the result is less pretty. Austin's airport (quad `0231301212221213`) shares only the zoom-level 8 tile `02313012`:
317-
318-
image::images/fu05-geographic-path-hq-to-airport.png[Path from Chimp HQ to AUS Airport]
319-
320-
==== Calculating Distances ====
321-
322-
To find the distance between two points on the globe, we use the Haversine formula
323-
324-
325-
in code:
326-
327-
# Return the haversine distance in meters between two points
328-
def haversine_distance(left, top, right, btm)
329-
delta_lng = (right - left).abs.to_radians
330-
delta_lat = (btm - top ).abs.to_radians
331-
top_rad = top.to_radians
332-
btm_rad = btm.to_radians
333-
334-
aa = (Math.sin(delta_lat / 2.0))**2 + Math.cos(top_rad) * Math.cos(btm_rad) * (Math.sin(delta_lng / 2.0))**2
335-
cc = 2.0 * Math.atan2(Math.sqrt(aa), Math.sqrt(1.0 - aa))
336-
cc * EARTH_RADIUS
337-
end
338-
339-
# Return the haversine midpoint in meters between two points
340-
def haversine_midpoint(left, top, right, btm)
341-
cos_btm = Math.cos(btm.to_radians)
342-
cos_top = Math.cos(top.to_radians)
343-
bearing_x = cos_btm * Math.cos((right - left).to_radians)
344-
bearing_y = cos_btm * Math.sin((right - left).to_radians)
345-
mid_lat = Math.atan2(
346-
(Math.sin(top.to_radians) + Math.sin(btm.to_radians)),
347-
(Math.sqrt((cos_top + bearing_x)**2 + bearing_y**2)))
348-
mid_lng = left.to_radians + Math.atan2(bearing_y, (cos_top + bearing_x))
349-
[mid_lng.to_degrees, mid_lat.to_degrees]
350-
end
351-
352-
# From a given point, calculate the point directly north a specified distance
353-
def point_north(longitude, latitude, distance)
354-
north_lat = (latitude.to_radians + (distance.to_f / EARTH_RADIUS)).to_degrees
355-
[longitude, north_lat]
356-
end
357-
358-
# From a given point, calculate the change in degrees directly east a given distance
359-
def point_east(longitude, latitude, distance)
360-
radius = EARTH_RADIUS * Math.sin(((Math::PI / 2.0) - latitude.to_radians.abs))
361-
east_lng = (longitude.to_radians + (distance.to_f / radius)).to_degrees
362-
[east_lng, latitude]
363-
end
364-
365-
===== Grid Sizes and Sample Preparation =====
366-
367-
Always include as a mountweazel some places you're familiar with. It's much easier for me to think in terms of the distance from my house to downtown, or to Dallas, or to New York than it is to think in terms of zoom level 14 or 7 or 4
368-
369-
==== Distributing Boundaries and Regions to Grid Cells ====
370-
371-
(TODO: Section under construction)
372-
373-
This section will show how to
374-
375-
* efficiently segment region polygons (county boundaries, watershed regions, etc) into grid cells
376-
* store data pertaining to such regions in a grid-cell form: for example, pivoting a population-by-county table into a population-of-each-overlapping-county record on each quadtile.

0 commit comments

Comments
 (0)