You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
==== Tell readers what the point of this is before you dive into the example. What are you showing them? Why? What will they get out of it? "I'm going to walk you through an example of ___, which will show you _____ so that you'll begin to understand how _____" for example.
6
+
7
+
[NOTE]
8
+
.Initial version
9
+
======
10
+
Igpay Atinlay translator, actual version is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. It’s written in Wukong, ...
11
+
======
12
+
13
+
Igpay Atinlay translator is our first Hadoop job, a program that translates plain text files into Igpay Atinlay. This is a Hadoop job stripped to its barest minimum, one that does just enough to each record that you believe it happened but with no distractions. That makes it convenient to learn how to launch a job; how to follow its progress; and where Hadoop reports performance metrics such as run time and amount of data moved. What's more, the very fact that it's trivial makes it one of the most important examples to run. For comparable input and output size, no regular Hadoop job can out-perform this one in practice, so it's a key reference point to carry in mind.
14
+
15
+
==== Whenever you say "It's best" be sure to include a statement of why it's best.
16
+
17
+
[NOTE]
18
+
.Initial version
19
+
======
20
+
It’s best to begin developing jobs locally on a subset of data. Run your Wukong script directly from your terminal’s commandline: ...
21
+
======
22
+
23
+
24
+
It's best to begin developing jobs locally on a subset of data: they are faster and cheaper to run. To run the Wukong script locally, enter this into your terminal's commandline:
25
+
26
+
(... a couple paragraphs later ...)
27
+
28
+
NOTE: There are even more reasons why it's best to begin developing jobs locally on a subset of data than just faster and cheaper. What's more, though, extracting a meaningful subset of tables also forces you to get to know your data and its relationships. And since all the data is local, you're forced into the good practice of first addressing "what would I like to do with this data" and only then considering "how shall I do so efficiently". Beginners often want to believe the opposite, but experience has taught us that it's nearly always worth the upfront investment to prepare a subset, and not to think about efficiency from the beginning.
29
+
30
+
==== Tell them what to expect before they run the job.
31
+
32
+
[NOTE]
33
+
.Initial version
34
+
======
35
+
First, let’s test on the same tiny little file we used at the commandline.
While the script outputs a bunch of happy robot-ese to your screen...
42
+
======
43
+
44
+
First, let's test on the same tiny little file we used at the commandline. This command does not process any data but instead instructs _Hadoop_ to process the data, and so its output will contain information on how the job is progressing.
Copy file name to clipboardExpand all lines: 06-analytic_patterns-structural_operations-ordering.asciidoc
+29Lines changed: 29 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -292,6 +292,33 @@ NOTE: We've cheated on the theme of this chapter (pipeline-only operations) -- s
292
292
// * (how do `null`s sort?)
293
293
// * ASC / DESC: fewest strikeouts per plate appearance
294
294
295
+
=== Numbering Records in Rank Order
296
+
297
+
If you supply only the name of the table, RANK acts as a pipeline operation, introducing no extra map/reduce stage. Each split is numbered as a unit: the third line of chunk `part-00000` gets rank 2, the third line of chunk `part-00001` gets rank 2, and so on.
298
+
299
+
When you give rank a field to act on, it
300
+
301
+
It's important to know that in current versions of Pig, the RANK operator sets parallelism one,
302
+
forcing all data to a single reducer. If your data is unacceptably large for this, you can use the
303
+
method used in (REF) "Assigning a unique identifier to each line" to get a unique compound index
304
+
that matches the total ordering, which might meet your needs. Otherwise, we can offer you no good
305
+
workaround -- frankly your best option may be to pay someone to fix this
This follows the general plot of 'Assign a Unique ID': enable a hash function UDF; load the files so that each input split has a stable handle; and number each line within the split. The important difference here is that the hash function we generated accepts a seed that we can mix in to each record. If you supply a constant to the constructor (see the documentation) then the records will be put into an effectively random order, but the same random order each time. By supplying the string `'rand'` as the argument, the UDF will use a different seed on each run. What's nice about this approach is that although the ordering is different from run to run, it does not exhibit the anti-pattern of changing from task attempt to task attempt. The seed is generated once and then used everywhere. Rather than creating a new random number for each row, you use the hash to define an effectively random ordering, and the seed to choose which random ordering to apply.
If you supply only the name of the table, RANK acts as a pipeline operation, introducing no extra map/reduce stage. Each split is numbered as a unit: the third line of chunk `part-00000` gets rank 2, the third line of chunk `part-00001` gets rank 2, and so on.
340
-
341
-
When you give rank a field to act on, it
342
-
343
-
It's important to know that in current versions of Pig, the RANK operator sets parallelism one,
344
-
forcing all data to a single reducer. If your data is unacceptably large for this, you can use the
345
-
method used in (REF) "Assigning a unique identifier to each line" to get a unique compound index
346
-
that matches the total ordering, which might meet your needs. Otherwise, we can offer you no good
347
-
workaround -- frankly your best option may be to pay someone to fix this
Copy file name to clipboardExpand all lines: 11b-spatial_aggregation-points.asciidoc
+17-6Lines changed: 17 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,14 @@
1
1
2
2
==== Smoothing Pointwise Data Locally (Spatial Aggregation of Points)
3
3
4
+
Let's start by extending the group-and-aggregate pattern -- introduced in Chapter Six (REF) and ubiqitous since --
4
5
5
-
We will start, as we always do, by applying patterns that turn Big Data into Much a Less Data. In particular,
6
-
A great tool for visualizing a large spatial data set
7
6
8
7
8
+
a great way to summarize a large data set, and one of the first things you’ll do to Know Thy Data.
9
+
This type of aggregation is a frontline tool of spatial analysis
10
+
It draws on methods you’ve already learned, giving us a chance to introduce some terminology and necessary details.
11
+
9
12
// * You want to "wash out" everything but the spatial variation -- even though the data was gathered for each
10
13
// * Point measurement of effect with local extent -- for example, the temperature measured at a weather station is understood to be representative of the weather for several surrounding miles.
11
14
// *
@@ -58,6 +61,13 @@ Then geonames places -- show lakes and streams (or something nature-y) vs someth
58
61
Do that again, but for a variable: airport flight volume -- researching
59
62
epidemiology
60
63
64
+
65
+
This would also be
66
+
n epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network
67
+
Combining this with the weather data
68
+
69
+
70
+
61
71
// FAA flight data http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/media/cy07_primary_np_comm.pdf
62
72
63
73
We can plot the number of air flights handled by every airport
An epidemiologist or transportation analyst interested in knowing the large-scale flux of people could throughout the global transportation network
81
-
82
90
===== Pattern Recap: Spatial Aggregation of Points
83
91
84
92
* _Generic Example_ -- group on tile cell, then apply the appropriate aggregation function
@@ -89,8 +97,11 @@ An epidemiologist or transportation analyst interested in knowing the large-scal
89
97
90
98
=== Matching Points within a Given Distance (Pointwise Spatial Join)
91
99
92
-
Now that you've learned the spatial equivalent of a `GROUP BY`, you'll probably be interested to
93
-
learn the spatial equivalent of `COGROUP` and `JOIN`.
100
+
Now that you've learned the spatial equivalent of a `GROUP BY` aggregation -- combining many records within a grid cell into a single summary record -- you'll probably be interested to
101
+
learn the spatial equivalent of `COGROUP` and `JOIN` --
102
+
collecting all records
103
+
104
+
94
105
In particular, let's demonstrate how to match all points in one table with every point in another table that are less than a given fixed distance apart.
95
106
96
107
Our reindeer friends would like us to help determin what UFO pilots do while visiting Earth.
Copy file name to clipboardExpand all lines: 11c-geospatial_mechanics.asciidoc
+24-11Lines changed: 24 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -60,17 +60,31 @@ We'll start with the operations that transform a shape on its own to produce a n
60
60
61
61
==== Constructing and Converting Geometry Objects
62
62
63
-
Somewhat related are operations that bring shapes in and out of Pig's control.
63
+
Somewhat related are operations that change the data types used to represent a shape.
64
64
65
-
* `FromWKText(chararray)`, `FromGeoJson(chararray)` -- converts the serialized description of a shape into the corresponding geometry object. We'll cover these data formats a bit later in the chapter. Similarly, `ToWKText(geom)` and `ToGeoJson(geom)` serialize a geometry into a string
65
+
Going from shape to coordinates-as-numbers lets you apply general-purpose manipulations
66
+
67
+
As a concrete example (but without going into the details), to identify patterns of periodic spacing in a set of coordinates footnote:[The methodical rows of trees in an apple orchard will appear as isolated frequency peaks oriented to the orchard plan; an old-growth forest would show little regularity and no directionality]
68
+
you'd quite likely want to extract the coordinates of your shapes as a bag of tuples, apply
69
+
a generic UDF implementing the 2-D FFT (Fast Fourier Transform) algorithm
70
+
71
+
72
+
.
73
+
The files in GeoJSON, WKT, or the other geographic formats described later in this Chapter (REF) produce records directly as geometry objects,
74
+
75
+
There are functions to construct Point, Multipoint, LineString, ... objects from coordinates you supply, and counterparts that extract a shape's coordinates as plain-old-Pig-objects.
* `GeoPoint(x_coord, y_coord)` -- constructs a `Point` from the given coordinates
67
80
* `GeoEnvelope( (x_min, y_min), (x_max, y_max) )` -- constructs an `Envelope` object from the numerically lowest and numerically highest coordinates. Note that it takes two tuples as inputs, not naked coordinates.
68
81
* `GeoMultiToBag(geom)` -- splits a (multi)geometry into a bag of simple geometries. A `MultiPoint` becomes a bag of `Points`; a `Point` becomes a bag with a single `Point`, and so forth.
69
82
* `GeoBagToMulti(geom)` -- combines a bag of geometries into a single multi geometry. For instance, a bag with any mixture of `Point` and `MultiPoint` geometries becomes a single `MultiPoint` object, and similarly for (multi)lines and (multi)polygons. All the elements must have the same dimension -- no mixing (multi)points with (multi)lines, etc.
83
+
* `FromWKText(chararray)`, `FromGeoJson(chararray)` -- converts the serialized description of a shape into the corresponding geometry object. We'll cover these data formats a bit later in the chapter. Similarly, `ToWKText(geom)` and `ToGeoJson(geom)` serialize a geometry into a string
84
+
70
85
71
86
// * (?name) GetPoints -- extract the collection of points from a geometry. Always returns a MultiPoint no matter what the input geometry.
72
87
// * (?name) GetLines -- extract the collection of lines or rings from a geometry. Returns `NULL` for a `Point`/`MultiPoint` input, and otherwise returns a MultiPoint no matter what the input geometry.
73
-
// * Point / MultiPoint / LineString / MultiLineString / Polygon / MultiPolygon -- construct given geometry
74
88
// - ClosedLineString -- bag of points to linestring, appending the initial point if it isn't identical to the final point
75
89
// * ForceMultiness
76
90
// * AsBinary, AsText
@@ -86,22 +100,21 @@ Somewhat related are operations that bring shapes in and out of Pig's control.
86
100
* `GeoX(point)`, `GeoY(point)` -- X or Y coordinates of a point
87
101
* `GeoLength(geom)`
88
102
* `GeoLength2dSpheroid(geom)` — Calculates the 2D length of a linestring/multilinestring on an ellipsoid. This is useful if the coordinates of the geometry are in longitude/latitude and a length is desired without reprojection.
89
-
* `GeoPerimeter(geom)` -- length measurement of a geometry's boundary
90
-
* `GeoDistanceSphere(geom)` — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters. Faster than GeoDistanceSpheroid, but less accurate
91
103
* `GeoDistance(geom)` -- the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
92
-
* `GeoMinDistance(geom)`
93
-
* `GeoMaxDistance(geom)` -- the 2-dimensional largest distance between two geometries in projected units
104
+
* `GeoDistanceSphere(geom)` — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters.
105
+
// * `GeoMaxDistance(geom)` -- the 2-dimensional largest distance between two geometries in projected units
94
106
// * IsNearby -- if some part of the geometries lie within the given distance apart
95
107
// * IsNearbyFully(geom_a, geom_b, distance) -- if all parts of each geometry lies within the given distance of each other.
108
+
// * `GeoPerimeter(geom)` -- length measurement of a geometry's boundary
96
109
97
110
There are also a set of meta-operations that report on the geometry objects representing a shape:
98
111
99
112
* `Dimension(geom)` -- This operation returns zero for Point and MultiPoint; 1 for LineString and MultiLineString; and 2 for Polygon and MultiPolygon, regardless of whether those shapes exist in a 2-D or 3-D space
100
113
* `CoordDim(geom)` -- the number of axes in the coordinate system being used: 2 for X-Y geometries, 3 for X-Y-Z geometries, and so on. Points, lines and polygons within a common coordinate system will all have the same value for `CoordDim`
* `IsGeomEmpty(geom)` -- 1 if the geometry contains no actual points.
103
-
* `IsLineClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
104
-
* `IsSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.
115
+
* `IsGeoEmpty(geom)` -- 1 if the geometry contains no actual points.
116
+
* `IsGeoClosed(line)` -- 1 if the given `LineString`'s end point meets its start point.
117
+
* `IsGeoSimple` -- 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.
105
118
* `IsLineRing` -- 1 if the given `LineString` is a ring -- that is, closed and simple.
106
119
107
120
* `NumGeometries(geom_collection)`
@@ -155,7 +168,7 @@ The geospatial toolbox has a set of precisely specified spatial relationships. T
155
168
* `Contains(geom_a, geom_b)` -- 1 if `geom_a` completely contains `geom_b`: that is, the shapes' interiors intersect, and no part of `geom_b` lies in the exterior of `geom_a`. If two shapes are equal, then it is true that each contains the other. `Contains(A, B)` is exactly equivalent to `Within(B, A)`.
156
169
// - `ContainsProperly(geom_a, geom_b)` -- 1 if : that is, the shapes' interiors intersect, and no part of `geom_b` intersects the exterior _or boundary_ of `geom_a`. The result of `Contains(A, A)` is always 1 and the result of `ContainsProperly(A,A) is always 0.
157
170
* `Within(geom_a, geom_b)` -- 1 if `geom_a` is completely contained by `geom_b`: that is, the shapes' interiors intersect, and no part of `geom_a` lies in the exterior of `geom_b`. If two shapes are equal, then it is true that each is within the other.
158
-
* `Covers(geom_a, geom_b)` -- 1 if no point in `geom_b` is outside `geom_a`. `CoveredBy(geom_a, geom_b)` is sugar for `Covers(geom_b, geom_a)`.
171
+
* `Covers(geom_a, geom_b)` -- 1 if no point in `geom_b` is outside `geom_a`. `CoveredBy(geom_a, geom_b)` is sugar for `Covers(geom_b, geom_a)`. (TODO: verify: A polygon covers its boundary but does not contain its boundary.)
159
172
* `Crosses(geom_a, geom_b)` -- 1 if the shapes cross: their geometries have some, but not all, interior points in common; and the dimension of the intersection is one less than the higher-dimension of the two shapes. That's a mouthful, so let's just look at the cases in turn:
160
173
- A MultiPoint crosses a (multi)line or (multi)polygon as long as at least one of its points lies in the other shape's interior, and at least one of its points lies in the other shape's exterior. Points along the border of the polygon(s) or the endpoints of the line(s) don't matter.
161
174
- A Line/MultiLine crosses a Polygon/MultiPolygon only when part of some line lies within the polygon(s)' interior and part of some line lies within the polygon(s)' exterior. Points along the border of a polygon or the endpoints of a line don't matter.
0 commit comments