You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 00-outlines.asciidoc
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,14 @@
1
1
2
+
9. Statistics
3
+
10. Event streams -- has some good examples, no real flow. The topic I'd be most excied to get in the book is the geo-ip matching, which demonstrates a range join.
4
+
12, 21, 22, 23. Hadoop internals and tuning. As you can see just from the number of files involved this is particularly disorganized. If you and I worked out a structure of what should be there I can organize the spare parts around it.
5
+
13. Data munging. This is some of the earliest material and thus some of the messiest. I don't believe this is worth reworking.
6
+
14. Organizing data -- only real material here is a rundown of data formats. Rough.
7
+
15. Filesystem mojo and `cat` herding -- runs down the commandline tools: wc, cut, etc. This is actually in decent shape, but should become an appendix I think.
8
+
18. Native Java API -- I'd like to have this chapter in there with either the content being the single sentence "Don't", or that sentence plus one prose paragraph saying you should write Hive or Pig UDFs instead.
9
+
19. Advanced Pig -- the material that's there, on pig config variables and two of the fancy joins, is not too messy. I'd like to at least tell readers about the replicated join, and probably even move it into the earlier chapters. The most we should do here would be to also describe an inline Python UDF and a Java UDF, but there's no material for that (though I do have code examples of UDFs)
Copy file name to clipboardExpand all lines: 05-map_only_patterns.asciidoc
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
-
== Analytic Patterns part 1: Pipeline Operations
1
+
== Analytic Patterns part 1: Map-only Operations
2
2
3
-
This chapter focuses exclusively on what we'll call 'pipelineable operations'.
4
-
A pipelineable operations is one that can handle each record in isolation, like the translator chimps from Chimpanzee & Elephant's first job. That property makes those operations trivially parallelizable: they require no reduce phase of their own.
3
+
This chapter focuses exclusively on what we'll call 'Map-only operations'.
4
+
A map-only operation is one that can handle each record in isolation, like the translator chimps from Chimpanzee & Elephant's first job. That property makes those operations trivially parallelizable: they require no reduce phase of their own.
5
5
6
-
When a script has only pipelineable operations, they give rise to one mapper-only job which executes the composed pipeline stages. When pipelinable operations are combined with the structural operations you'll meet in the next chapter, they are composed with the stages of the mapper or reducer (depending on whether they come before or after the structural operation).
6
+
When a script has only map-only operations, they give rise to one mapper-only job which executes the composed pipeline stages. When map-only operations are combined with the structural operations you'll meet in the next chapter, they are composed with the stages of the mapper or reducer (depending on whether they come before or after the structural operation).
7
7
8
8
All of these are listed first and together for two reasons. One, they are largely fundamental; it's hard to get much done without `FILTER` or `FOREACH`. Two, the way you reason about the performance impact of these operations is largely the same. Since these operations are trivially paralellizable, they scale efficiently and the computation cost rarely impedes throughput. And when pipelined, their performance cost can be summarized as "kids eat free with purchase of adult meal". For datasets of any material size, it's very rare that the cost of preliminary or follow-on processing rivals the cost of the reduce phase. Finally, since these operations handle records in isolation, their memory impact is modest. So learn to think of these together.
9
9
@@ -70,7 +70,7 @@ Blocks like the following will show up after each of the patterns or groups of p
70
70
- Programmers take note: `AND`, `OR` -- not `&&`, `||`.
71
71
* _Output Count_ -- (_How many records in the output: fewer, same, more, explosively more?_) Zero to 100% of the input record count. Data size will decrease accordingly
72
72
* _Records_ -- (_A sketch of what the records coming out of this operation look like_) Identical to input
73
-
* _Data Flow_ -- (_The Hadoop jobs this operation gives rise to. In this chapter, all the lines will look like this one; in the next chapters that will change_) Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
73
+
* _Data Flow_ -- (_The Hadoop jobs this operation gives rise to. In this chapter, all the lines will look like this one; in the next chapters that will change_) Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
74
74
* _Exercises for You_ -- (_A mission to carry forward, if you choose. Don't go looking for an answer section -- we haven't done any of them. In many cases you'll be the first to find the answer._) Play around with `null`s and the conditional operators until you have a good sense of its quirks.
75
75
* _See Also_ -- (_Besides the patterns in its section of the book, what other topics might apply if you're considering this one? Sometimes this is another section in the book, sometimes it's a pointer elsewhere_) The Distinct operations, some Set operations, and some Joins are also used to eliminate records according to some criteria. See especially the Semi-Join and Anti-Join (REF), which select or reject matches against a large list of keys.
76
76
@@ -123,7 +123,7 @@ NOTE: Sadly, the Nobel Prize-winning physicists Gerard 't Hooft, Louis-Victor Pi
123
123
- You're far better off learning one extra thing to do with a regular expression than most of the other string conditional functions Pig offers.
124
124
- ... and enough other Importants to Know that we made a sidebar of them (REF).
125
125
* _Records_ -- You can use this in a filter clause but also anywhere else an expression is permitted, like the preceding snippet
126
-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
126
+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
127
127
* _Exercises for You_ -- Follow the http://regexp.info/tutorial.html[regexp.info tutorial], but _only up to the part on Grouping & Capturing_. The rest you are far better off picking up once you find you need it.
128
128
* _See Also_ -- The Pig `REGEX_EXTRACT` and http://pig.apache.org/docs/r0.12.0/func.html#replace[`REPLACE`] functions. Java's http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#sum[Regular Expression] documentation for details on its pecadilloes (but not for an education about regular expressions).
129
129
@@ -152,7 +152,7 @@ The general case is handled bu using a join, as described in the next chapter (R
152
152
* _Hello, SQL Users_ -- This isn't anywhere near as powerful as SQL's `IN` expression. Most importantly, you can't supply another table as the list.
153
153
* _Important to Know_ -- A regular expression alternation is often the right choice instead.
154
154
* _Output Count_ -- As many records as the cardinality of its key, i.e. the number of distinct values. Data size should decrease greatly.
155
-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
155
+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
156
156
157
157
=== Project Only Chosen Columns by Name
158
158
@@ -194,7 +194,7 @@ The first projection puts the `home_team_id` into the team slot, renaming it `te
194
194
* _Important to Know_ -- As you can see, we take a lot of care visually aligning subexpressions within the code snippets. That's not because we've tidied up the house for students coming over -- this is what the code we write and the code our teammates expect us to write looks like.
195
195
* _Output Count_ -- Exactly the same as the input.
196
196
* _Records_ -- However you define them to be
197
-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
197
+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
198
198
* _See Also_ -- "Assembling Literals with Complex Type" (REF)
199
199
200
200
==== Extracting a Random Sample of Records
@@ -219,7 +219,7 @@ Experienced software developers will reach for a "seeding" function -- such as R
219
219
- The DataFu package has UDFs for sampling with replacement and other advanced features.
220
220
* _Output Count_ -- Determined by the sampling fraction. As a rule of thumb, variances of things are square-root-ish; expect the size of a 10% sample to be in the 7%-13% range.
221
221
* _Records_ -- Identical to the input
222
-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
222
+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
223
223
* _Exercises for You_ -- Modify Pig's SAMPLE function to accept a seed parameter, and submit that patch back to the open-source project. This is a bit harder to do than it seems: sampling is key to efficient sorting and so the code to sample data is intertwingled with a lot of core functionality.
224
224
225
225
==== Extracting a Consistent Sample of Records by Key
@@ -242,7 +242,7 @@ We called this a terrible hash function, but it does fit the bill. When applied
242
242
- If you'll be spending a bunch of time with a data set, using any kind of random sample to prepare your development sample might be a stupid idea. You'll notice that Red Sox players show up a lot of times in our examples -- that's because our development samples are "seasons by Red Sox players" and "seasons from 2000-2010", which lets us make good friends with the data.
243
243
* _Output Count_ -- Determined by the sampling fraction. As a rule of thumb, variances of things are square-root-ish; expect the size of a 10% sample to be in the 7%-13% range.
244
244
* _Records_ -- Identical to the input
245
-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
245
+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
246
246
247
247
==== Sampling Carelessly by Only Loading Some `part-` Files
0 commit comments