Skip to content

Commit 182d895

Browse files
author
Philip (flip) Kromer
committed
organizing and attic-ing material to determine final outline
1 parent 2852f5b commit 182d895

File tree

77 files changed

+707
-710
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+707
-710
lines changed

00-outlines.asciidoc

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,14 @@
11

2+
9. Statistics
3+
10. Event streams -- has some good examples, no real flow. The topic I'd be most excied to get in the book is the geo-ip matching, which demonstrates a range join.
4+
12, 21, 22, 23. Hadoop internals and tuning. As you can see just from the number of files involved this is particularly disorganized. If you and I worked out a structure of what should be there I can organize the spare parts around it.
5+
13. Data munging. This is some of the earliest material and thus some of the messiest. I don't believe this is worth reworking.
6+
14. Organizing data -- only real material here is a rundown of data formats. Rough.
7+
15. Filesystem mojo and `cat` herding -- runs down the commandline tools: wc, cut, etc. This is actually in decent shape, but should become an appendix I think.
8+
18. Native Java API -- I'd like to have this chapter in there with either the content being the single sentence "Don't", or that sentence plus one prose paragraph saying you should write Hive or Pig UDFs instead.
9+
19. Advanced Pig -- the material that's there, on pig config variables and two of the fancy joins, is not too messy. I'd like to at least tell readers about the replicated join, and probably even move it into the earlier chapters. The most we should do here would be to also describe an inline Python UDF and a Java UDF, but there's no material for that (though I do have code examples of UDFs)
10+
11+
212
10. **Event log**
313
- geo IP via range query
414
- sessionizing, user paths
File renamed without changes.
File renamed without changes.
File renamed without changes.

05-analytic_patterns-pipeline_operations.asciidoc renamed to 05-map_only_patterns.asciidoc

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
== Analytic Patterns part 1: Pipeline Operations
1+
== Analytic Patterns part 1: Map-only Operations
22

3-
This chapter focuses exclusively on what we'll call 'pipelineable operations'.
4-
A pipelineable operations is one that can handle each record in isolation, like the translator chimps from Chimpanzee & Elephant's first job. That property makes those operations trivially parallelizable: they require no reduce phase of their own.
3+
This chapter focuses exclusively on what we'll call 'Map-only operations'.
4+
A map-only operation is one that can handle each record in isolation, like the translator chimps from Chimpanzee & Elephant's first job. That property makes those operations trivially parallelizable: they require no reduce phase of their own.
55

6-
When a script has only pipelineable operations, they give rise to one mapper-only job which executes the composed pipeline stages. When pipelinable operations are combined with the structural operations you'll meet in the next chapter, they are composed with the stages of the mapper or reducer (depending on whether they come before or after the structural operation).
6+
When a script has only map-only operations, they give rise to one mapper-only job which executes the composed pipeline stages. When map-only operations are combined with the structural operations you'll meet in the next chapter, they are composed with the stages of the mapper or reducer (depending on whether they come before or after the structural operation).
77

88
All of these are listed first and together for two reasons. One, they are largely fundamental; it's hard to get much done without `FILTER` or `FOREACH`. Two, the way you reason about the performance impact of these operations is largely the same. Since these operations are trivially paralellizable, they scale efficiently and the computation cost rarely impedes throughput. And when pipelined, their performance cost can be summarized as "kids eat free with purchase of adult meal". For datasets of any material size, it's very rare that the cost of preliminary or follow-on processing rivals the cost of the reduce phase. Finally, since these operations handle records in isolation, their memory impact is modest. So learn to think of these together.
99

@@ -70,7 +70,7 @@ Blocks like the following will show up after each of the patterns or groups of p
7070
- Programmers take note: `AND`, `OR` -- not `&&`, `||`.
7171
* _Output Count_ -- (_How many records in the output: fewer, same, more, explosively more?_) Zero to 100% of the input record count. Data size will decrease accordingly
7272
* _Records_ -- (_A sketch of what the records coming out of this operation look like_) Identical to input
73-
* _Data Flow_ -- (_The Hadoop jobs this operation gives rise to. In this chapter, all the lines will look like this one; in the next chapters that will change_) Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
73+
* _Data Flow_ -- (_The Hadoop jobs this operation gives rise to. In this chapter, all the lines will look like this one; in the next chapters that will change_) Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
7474
* _Exercises for You_ -- (_A mission to carry forward, if you choose. Don't go looking for an answer section -- we haven't done any of them. In many cases you'll be the first to find the answer._) Play around with `null`s and the conditional operators until you have a good sense of its quirks.
7575
* _See Also_ -- (_Besides the patterns in its section of the book, what other topics might apply if you're considering this one? Sometimes this is another section in the book, sometimes it's a pointer elsewhere_) The Distinct operations, some Set operations, and some Joins are also used to eliminate records according to some criteria. See especially the Semi-Join and Anti-Join (REF), which select or reject matches against a large list of keys.
7676

@@ -123,7 +123,7 @@ NOTE: Sadly, the Nobel Prize-winning physicists Gerard 't Hooft, Louis-Victor Pi
123123
- You're far better off learning one extra thing to do with a regular expression than most of the other string conditional functions Pig offers.
124124
- ... and enough other Importants to Know that we made a sidebar of them (REF).
125125
* _Records_ -- You can use this in a filter clause but also anywhere else an expression is permitted, like the preceding snippet
126-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
126+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
127127
* _Exercises for You_ -- Follow the http://regexp.info/tutorial.html[regexp.info tutorial], but _only up to the part on Grouping & Capturing_. The rest you are far better off picking up once you find you need it.
128128
* _See Also_ -- The Pig `REGEX_EXTRACT` and http://pig.apache.org/docs/r0.12.0/func.html#replace[`REPLACE`] functions. Java's http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#sum[Regular Expression] documentation for details on its pecadilloes (but not for an education about regular expressions).
129129

@@ -152,7 +152,7 @@ The general case is handled bu using a join, as described in the next chapter (R
152152
* _Hello, SQL Users_ -- This isn't anywhere near as powerful as SQL's `IN` expression. Most importantly, you can't supply another table as the list.
153153
* _Important to Know_ -- A regular expression alternation is often the right choice instead.
154154
* _Output Count_ -- As many records as the cardinality of its key, i.e. the number of distinct values. Data size should decrease greatly.
155-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
155+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
156156

157157
=== Project Only Chosen Columns by Name
158158

@@ -194,7 +194,7 @@ The first projection puts the `home_team_id` into the team slot, renaming it `te
194194
* _Important to Know_ -- As you can see, we take a lot of care visually aligning subexpressions within the code snippets. That's not because we've tidied up the house for students coming over -- this is what the code we write and the code our teammates expect us to write looks like.
195195
* _Output Count_ -- Exactly the same as the input.
196196
* _Records_ -- However you define them to be
197-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
197+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
198198
* _See Also_ -- "Assembling Literals with Complex Type" (REF)
199199

200200
==== Extracting a Random Sample of Records
@@ -219,7 +219,7 @@ Experienced software developers will reach for a "seeding" function -- such as R
219219
- The DataFu package has UDFs for sampling with replacement and other advanced features.
220220
* _Output Count_ -- Determined by the sampling fraction. As a rule of thumb, variances of things are square-root-ish; expect the size of a 10% sample to be in the 7%-13% range.
221221
* _Records_ -- Identical to the input
222-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
222+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
223223
* _Exercises for You_ -- Modify Pig's SAMPLE function to accept a seed parameter, and submit that patch back to the open-source project. This is a bit harder to do than it seems: sampling is key to efficient sorting and so the code to sample data is intertwingled with a lot of core functionality.
224224

225225
==== Extracting a Consistent Sample of Records by Key
@@ -242,7 +242,7 @@ We called this a terrible hash function, but it does fit the bill. When applied
242242
- If you'll be spending a bunch of time with a data set, using any kind of random sample to prepare your development sample might be a stupid idea. You'll notice that Red Sox players show up a lot of times in our examples -- that's because our development samples are "seasons by Red Sox players" and "seasons from 2000-2010", which lets us make good friends with the data.
243243
* _Output Count_ -- Determined by the sampling fraction. As a rule of thumb, variances of things are square-root-ish; expect the size of a 10% sample to be in the 7%-13% range.
244244
* _Records_ -- Identical to the input
245-
* _Data Flow_ -- Pipelinable: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
245+
* _Data Flow_ -- Map-Only: it's composed onto the end of the preceding map or reduce, and if it stands alone becomes a map-only job.
246246

247247
==== Sampling Carelessly by Only Loading Some `part-` Files
248248

0 commit comments

Comments
 (0)