Skip to content

Commit a577037

Browse files
author
Philip (flip) Kromer
committed
working through editorial feedback
1 parent e308ee1 commit a577037

14 files changed

+462
-144
lines changed

01-intro.asciidoc

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
2+
3+
why Hadoop is a breakthrough tool and examples of how you can use it to transform, simplify, contextualize, and organize data.
4+
5+
* distributes the data
6+
* context (group)
7+
* matching (cogroup / join)
8+
*
9+
10+
11+
* coordinates to grid cells
12+
* group on location
13+
* count articles
14+
* wordbag
15+
* join wordbags to coordinates
16+
* sum counts
17+
18+
19+
20+
21+

02-feedback_and_response.asciidoc

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
==== Introduction Structure
2+
3+
4+
5+
==== Tell readers what the point of this is before you dive into the example. What are you showing them? Why? What will they get out of it? "I'm going to walk you through an example of ___, which will show you _____ so that you'll begin to understand how _____" for example.
6+
7+
[NOTE]
8+
.Initial version
9+
======
10+
Igpay Atinlay translator, actual version is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. It’s written in Wukong, ...
11+
======
12+
13+
Igpay Atinlay translator is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. This is a Hadoop job stripped to its barest minimum, one that does just enough to each record that you believe it happened but with no distractions. That makes it convenient to learn how to launch a job; how to follow its progress; and where Hadoop reports performance metrics such as run time and amount of data moved. What's more, the very fact that it's trivial makes it one of the most important examples to run. For comparable input and output size, no regular Hadoop job can out-perform this one in practice, so it's a key reference point to carry in mind.
14+
15+
==== Whenever you say "It's best" be sure to include a statement of why it's best.
16+
17+
[NOTE]
18+
.Initial version
19+
======
20+
It’s best to begin developing jobs locally on a subset of data. Run your Wukong script directly from your terminal’s commandline: ...
21+
======
22+
23+
24+
It's best to begin developing jobs locally on a subset of data: they are faster and cheaper to run. To run the Wukong script locally, enter this into your terminal's commandline:
25+
26+
(... a couple paragraphs later ...)
27+
28+
NOTE: There are even more reasons why it's best to begin developing jobs locally on a subset of data than just faster and cheaper. What's more, though, extracting a meaningful subset of tables also forces you to get to know your data and its relationships. And since all the data is local, you're forced into the good practice of first addressing "what would I like to do with this data" and only then considering "how shall I do so efficiently". Beginners often want to believe the opposite, but experience has taught us that it's nearly always worth the upfront investment to prepare a subset, and not to think about efficiency from the beginning.
29+
30+
==== Tell them what to expect before they run the job.
31+
32+
[NOTE]
33+
.Initial version
34+
======
35+
First, let’s test on the same tiny little file we used at the commandline.
36+
37+
------
38+
wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
39+
------
40+
41+
While the script outputs a bunch of happy robot-ese to your screen...
42+
======
43+
44+
First, let's test on the same tiny little file we used at the commandline. This command does not process any data but instead instructs _Hadoop_ to process the data, and so its output will contain information on how the job is progressing.
45+
46+
------
47+
wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi.txt
48+
------
49+
50+
While the script outputs a bunch of happy robot-ese to your screen ...

02-hadoop_basics.asciidoc

Lines changed: 75 additions & 18 deletions
Large diffs are not rendered by default.

06-analytic_patterns-structural_operations-ordering.asciidoc

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -292,6 +292,33 @@ NOTE: We've cheated on the theme of this chapter (pipeline-only operations) -- s
292292
// * (how do `null`s sort?)
293293
// * ASC / DESC: fewest strikeouts per plate appearance
294294

295+
=== Numbering Records in Rank Order
296+
297+
If you supply only the name of the table, RANK acts as a pipeline operation, introducing no extra map/reduce stage. Each split is numbered as a unit: the third line of chunk `part-00000` gets rank 2, the third line of chunk `part-00001` gets rank 2, and so on.
298+
299+
When you give rank a field to act on, it
300+
301+
It's important to know that in current versions of Pig, the RANK operator sets parallelism one,
302+
forcing all data to a single reducer. If your data is unacceptably large for this, you can use the
303+
method used in (REF) "Assigning a unique identifier to each line" to get a unique compound index
304+
that matches the total ordering, which might meet your needs. Otherwise, we can offer you no good
305+
workaround -- frankly your best option may be to pay someone to fix this
306+
307+
------
308+
gift_id gift RANK RANK gift_id RANK gift DENSE
309+
1 partridge 1 1 1
310+
4a calling birds 2 4 7
311+
4b calling birds 3 4 7
312+
2a turtle dove 4 2 2
313+
4d calling birds 5 4 7
314+
5 golden rings 6 5 11
315+
2b turtle dove 7 2 2
316+
3a french hen 8 3 4
317+
3b french hen 9 3 4
318+
3c french hen 10 3 4
319+
4c calling birds 11 4 7
320+
------
321+
295322
// ==== Rank records in a group using Stitch/Over
296323
//
297324
//
@@ -420,3 +447,5 @@ STORE_TABLE('vals_shuffled', vals_shuffled);
420447
-----
421448

422449
This follows the general plot of 'Assign a Unique ID': enable a hash function UDF; load the files so that each input split has a stable handle; and number each line within the split. The important difference here is that the hash function we generated accepts a seed that we can mix in to each record. If you supply a constant to the constructor (see the documentation) then the records will be put into an effectively random order, but the same random order each time. By supplying the string `'rand'` as the argument, the UDF will use a different seed on each run. What's nice about this approach is that although the ordering is different from run to run, it does not exhibit the anti-pattern of changing from task attempt to task attempt. The seed is generated once and then used everywhere. Rather than creating a new random number for each row, you use the hash to define an effectively random ordering, and the seed to choose which random ordering to apply.
450+
451+

10-advanced_patterns.asciidoc

Lines changed: 0 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -333,34 +333,6 @@ You'll see a more elaborate version of this
333333
// -- STORE_TABLE(normed_seasons, 'normed_seasons');
334334

335335

336-
=== Numbering Records in Rank Order
337-
338-
339-
If you supply only the name of the table, RANK acts as a pipeline operation, introducing no extra map/reduce stage. Each split is numbered as a unit: the third line of chunk `part-00000` gets rank 2, the third line of chunk `part-00001` gets rank 2, and so on.
340-
341-
When you give rank a field to act on, it
342-
343-
It's important to know that in current versions of Pig, the RANK operator sets parallelism one,
344-
forcing all data to a single reducer. If your data is unacceptably large for this, you can use the
345-
method used in (REF) "Assigning a unique identifier to each line" to get a unique compound index
346-
that matches the total ordering, which might meet your needs. Otherwise, we can offer you no good
347-
workaround -- frankly your best option may be to pay someone to fix this
348-
349-
------
350-
gift RANK RANK gift RANK gift DENSE
351-
partridge 1 1 1
352-
turtle dove 2 2 2
353-
turtle dove 3 2 2
354-
french hen 4 3 4
355-
french hen 5 3 4
356-
french hen 6 3 4
357-
calling birds 7 4 7
358-
calling birds 8 4 7
359-
calling birds 9 4 7
360-
calling birds 10 4 7
361-
K golden rings 11 5 11
362-
------
363-
364336

365337
// -- ***************************************************************************
366338
// --

10-event_streams.asciidoc

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,6 @@ Unless of course you are trying to test a service for resilience against an adve
151151
flow(:mapper){ input > parse_loglines > elephant_stampede }
152152
----
153153

154-
155154
You must use Wukong's eventmachine bindings to make more than one simultaneous request per mapper.
156155

157156
=== Refs ===

11b-spatial_aggregation-points.asciidoc

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,14 @@
11

22
==== Smoothing Pointwise Data Locally (Spatial Aggregation of Points)
33

4+
Let's start by extending the group-and-aggregate pattern -- introduced in Chapter Six (REF) and ubiqitous since --
45

5-
We will start, as we always do, by applying patterns that turn Big Data into Much a Less Data. In particular,
6-
A great tool for visualizing a large spatial data set
76

87

8+
a great way to summarize a large data set, and one of the first things you’ll do to Know Thy Data.
9+
This type of aggregation is a frontline tool of spatial analysis
10+
It draws on methods you’ve already learned, giving us a chance to introduce some terminology and necessary details.
11+
912
// * You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
1013
// * Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
1114
// *
@@ -58,6 +61,13 @@ Then geonames places -- show lakes and streams (or something nature-y) vs someth
5861
Do that again, but for a variable: airport flight volume -- researching
5962
epidemiology
6063

64+
65+
This would also be
66+
n epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network
67+
Combining this with the weather data
68+
69+
70+
6171
// FAA flight data http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/media/cy07_primary_np_comm.pdf
6272

6373
We can plot the number of air flights handled by every airport
@@ -77,8 +87,6 @@ grid_cts = FOREACH (GROUP gridded BY (bin_x, bin_y))
7787
SUM(n_flights) AS tot_flights;
7888
------
7989

80-
An epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network
81-
8290
===== Pattern Recap: Spatial Aggregation of Points
8391

8492
* _Generic Example_ -- group on tile cell, then apply the appropriate aggregation function
@@ -89,8 +97,11 @@ An epidemiologist or transportation analyst interested in knowing the large-scal
8997

9098
=== Matching Points within a Given Distance (Pointwise Spatial Join)
9199

92-
Now that you've learned the spatial equivalent of a `GROUP BY`, you'll probably be interested to
93-
learn the spatial equivalent of `COGROUP` and `JOIN`.
100+
Now that you've learned the spatial equivalent of a `GROUP BY` aggregation -- combining many records within a grid cell into a single summary record -- you'll probably be interested to
101+
learn the spatial equivalent of `COGROUP` and `JOIN` --
102+
collecting all records
103+
104+
94105
In particular, let's demonstrate how to match all points in one table with every point in another table that are less than a given fixed distance apart.
95106

96107
Our reindeer friends would like us to help determin what UFO pilots do while visiting Earth.

11c-geospatial_mechanics.asciidoc

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -60,17 +60,31 @@ We'll start with the operations that transform a shape on its own to produce a n
6060

6161
==== Constructing and Converting Geometry Objects
6262

63-
Somewhat related are operations that bring shapes in and out of Pig's control.
63+
Somewhat related are operations that change the data types used to represent a shape.
6464

65-
* `FromWKText(chararray)`, `FromGeoJson(chararray)` -- converts the serialized description of a shape into the corresponding geometry object. We'll cover these data formats a bit later in the chapter. Similarly, `ToWKText(geom)` and `ToGeoJson(geom)` serialize a geometry into a string
65+
Going from shape to coordinates-as-numbers lets you apply general-purpose manipulations
66+
67+
As a concrete example (but without going into the details), to identify patterns of periodic spacing in a set of coordinates footnote:[The methodical rows of trees in an apple orchard will appear as isolated frequency peaks oriented to the orchard plan; an old-growth forest would show little regularity and no directionality]
68+
you'd quite likely want to extract the coordinates of your shapes as a bag of tuples, apply
69+
a generic UDF implementing the 2-D FFT (Fast Fourier Transform) algorithm
70+
71+
72+
.
73+
The files in GeoJSON, WKT, or the other geographic formats described later in this Chapter (REF) produce records directly as geometry objects,
74+
75+
There are functions to construct Point, Multipoint, LineString, ... objects from coordinates you supply, and counterparts that extract a shape's coordinates as plain-old-Pig-objects.
76+
77+
78+
* `Point` / `MultiPoint` / `LineString` / `MultiLineString` / `Polygon` / `MultiPolygon` -- construct given geometry.
6679
* `GeoPoint(x_coord, y_coord)` -- constructs a `Point` from the given coordinates
6780
* `GeoEnvelope( (x_min, y_min), (x_max, y_max) )` -- constructs an `Envelope` object from the numerically lowest and numerically highest coordinates. Note that it takes two tuples as inputs, not naked coordinates.
6881
* `GeoMultiToBag(geom)` -- splits a (multi)geometry into a bag of simple geometries. A `MultiPoint` becomes a bag of `Points`; a `Point` becomes a bag with a single `Point`, and so forth.
6982
* `GeoBagToMulti(geom)` -- combines a bag of geometries into a single multi geometry. For instance, a bag with any mixture of `Point` and `MultiPoint` geometries becomes a single `MultiPoint` object, and similarly for (multi)lines and (multi)polygons. All the elements must have the same dimension -- no mixing (multi)points with (multi)lines, etc.
83+
* `FromWKText(chararray)`, `FromGeoJson(chararray)` -- converts the serialized description of a shape into the corresponding geometry object. We'll cover these data formats a bit later in the chapter. Similarly, `ToWKText(geom)` and `ToGeoJson(geom)` serialize a geometry into a string
84+
7085

7186
// * (?name) GetPoints -- extract the collection of points from a geometry. Always returns a MultiPoint no matter what the input geometry.
7287
// * (?name) GetLines -- extract the collection of lines or rings from a geometry. Returns `NULL` for a `Point`/`MultiPoint` input, and otherwise returns a MultiPoint no matter what the input geometry.
73-
// * Point / MultiPoint / LineString / MultiLineString / Polygon / MultiPolygon -- construct given geometry
7488
// - ClosedLineString -- bag of points to linestring, appending the initial point if it isn't identical to the final point
7589
// * ForceMultiness
7690
// * AsBinary, AsText
@@ -86,22 +100,21 @@ Somewhat related are operations that bring shapes in and out of Pig's control.
86100
* `GeoX(point)`, `GeoY(point)` -- X or Y coordinates of a point
87101
* `GeoLength(geom)`
88102
* `GeoLength2dSpheroid(geom)` — Calculates the 2D length of a linestring/multilinestring on an ellipsoid. This is useful if the coordinates of the geometry are in longitude/latitude and a length is desired without reprojection.
89-
* `GeoPerimeter(geom)` -- length measurement of a geometry's boundary
90-
* `GeoDistanceSphere(geom)` — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters. Faster than GeoDistanceSpheroid, but less accurate
91103
* `GeoDistance(geom)` -- the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
92-
* `GeoMinDistance(geom)`
93-
* `GeoMaxDistance(geom)` -- the 2-dimensional largest distance between two geometries in projected units
104+
* `GeoDistanceSphere(geom)` — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters.
105+
// * `GeoMaxDistance(geom)` -- the 2-dimensional largest distance between two geometries in projected units
94106
// * IsNearby -- if some part of the geometries lie within the given distance apart
95107
// * IsNearbyFully(geom_a, geom_b, distance) -- if all parts of each geometry lies within the given distance of each other.
108+
// * `GeoPerimeter(geom)` -- length measurement of a geometry's boundary
96109

97110
There are also a set of meta-operations that report on the geometry objects representing a shape:
98111

99112
* `Dimension(geom)` -- This operation returns zero for Point and MultiPoint; 1 for LineString and MultiLineString; and 2 for Polygon and MultiPolygon, regardless of whether those shapes exist in a 2-D or 3-D space
100113
* `CoordDim(geom)` -- the number of axes in the coordinate system being used: 2 for X-Y geometries, 3 for X-Y-Z geometries, and so on. Points, lines and polygons within a common coordinate system will all have the same value for `CoordDim`
101114
* `GeometryType(geom)` -- string representing the geometry type: `'Point'`, `'LineString'`, ..., `'MultiPolygon'`.
102-
* `IsGeomEmpty(geom)` -- 1 if the geometry contains no actual points.
103-
* `IsLineClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
104-
* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.
115+
* `IsGeoEmpty(geom)` -- 1 if the geometry contains no actual points.
116+
* `IsGeoClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
117+
* `IsGeoSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.
105118
* `IsLineRing` -- 1 if the given `LineString` is a ring -- that is, closed and simple.
106119

107120
* `NumGeometries(geom_collection)`
@@ -155,7 +168,7 @@ The geospatial toolbox has a set of precisely specified spatial relationships. T
155168
* `Contains(geom_a, geom_b)` -- 1 if `geom_a` completely contains `geom_b`: that is, the shapes' interiors intersect, and no part of `geom_b` lies in the exterior of `geom_a`. If two shapes are equal, then it is true that each contains the other. `Contains(A, B)` is exactly equivalent to `Within(B, A)`.
156169
// - `ContainsProperly(geom_a, geom_b)` -- 1 if : that is, the shapes' interiors intersect, and no part of `geom_b` intersects the exterior _or boundary_ of `geom_a`. The result of `Contains(A, A)` is always 1 and the result of `ContainsProperly(A,A) is always 0.
157170
* `Within(geom_a, geom_b)` -- 1 if `geom_a` is completely contained by `geom_b`: that is, the shapes' interiors intersect, and no part of `geom_a` lies in the exterior of `geom_b`. If two shapes are equal, then it is true that each is within the other.
158-
* `Covers(geom_a, geom_b)` -- 1 if no point in `geom_b` is outside `geom_a`. `CoveredBy(geom_a, geom_b)` is sugar for `Covers(geom_b, geom_a)`.
171+
* `Covers(geom_a, geom_b)` -- 1 if no point in `geom_b` is outside `geom_a`. `CoveredBy(geom_a, geom_b)` is sugar for `Covers(geom_b, geom_a)`. (TODO: verify: A polygon covers its boundary but does not contain its boundary.)
159172
* `Crosses(geom_a, geom_b)` -- 1 if the shapes cross: their geometries have some, but not all, interior points in common; and the dimension of the intersection is one less than the higher-dimension of the two shapes. That's a mouthful, so let's just look at the cases in turn:
160173
- A MultiPoint crosses a (multi)line or (multi)polygon as long as at least one of its points lies in the other shape's interior, and at least one of its points lies in the other shape's exterior. Points along the border of the polygon(s) or the endpoints of the line(s) don't matter.
161174
- A Line/MultiLine crosses a Polygon/MultiPolygon only when part of some line lies within the polygon(s)' interior and part of some line lies within the polygon(s)' exterior. Points along the border of a polygon or the endpoints of a line don't matter.

0 commit comments

Comments
 (0)