Skip to content

Commit c2959a8

Browse files
author
Philip (flip) Kromer
committed
minor revisions including a new opening that will actually work hooray
1 parent 4aef474 commit c2959a8

9 files changed

+255
-68
lines changed

00-outlines.asciidoc

Lines changed: 74 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,9 @@
11

2-
3-
5. **Pipelineable** Additions/Improvements
4-
- Controlling Case Sensitivity in String Comparisons (`ff = FILTER fr BY EqualsIgnoreCase(franch_id, 'bOs'); DUMP ff;`)
5-
- Select Records Using a List of Values
6-
- very small inline list with the `CASE` statement -- `CASE X WHEN val1 ... WHEN val2 ... ELSE .. END` and `CASE WHEN cond .. WHEN cond .. ELSE .. END`)
7-
- Transforming Strings with Regular Expressions
8-
- Transforming Nulls into Real Values
9-
- Converting a Number to its String Representation (and Back) (cast with (int))
10-
- Embedding Quotes and Special Characters Within String Literals.
11-
- JSON encoding/decoding on a value (vs on a record)
12-
- Assigning a Unique Identifier to Each Record (use `-tagPath` when loading; may require most recent Pig)
13-
- `$0` references; `*` and `..` refereces
14-
15-
- Flattening a tuple gives you columns; Flattening a bag gives you rows
16-
- Flattening bags == cross product
17-
- Transposing Columns into Records (make the inline bag from several fields, then flatten it)
18-
- Converting an Internally-Delimited Field into Multiple Columns Using STRSPLIT
19-
- Converting an Internally-Delimited Field into Multiple Rows Using STRSPLITBAG
20-
- Exploding a String into its Characters
21-
- demonstrate case and ternary statements (combine/move demonstration in filter section?)
22-
23-
6. **Grouping** Additions/Improvements
24-
- JSON-encoded string
25-
- completely summarizing
26-
27-
28-
7. **Joining** Additions/Improvements
29-
- Replicated join
30-
- stuff in "important notes about joins"
31-
32-
8. **Ordering and Uniquing**
33-
- Demonstrate Sort in map/reduce
34-
- max with/without ties, with/without record
35-
- top-k with/without ties, with/without record
36-
- running min/max
37-
- mode (make an exercise)
38-
- cardinality ie count of distict values
39-
40-
9. **Advanced Patterns**
41-
- Better COGROUP
42-
- Stitch and Over
43-
- multi-join,
44-
- master-detail
45-
- z-score
46-
- group/decorate/flatten
47-
- group/flatten/re-flatten
48-
- cube & rollup
49-
- run expectancy (prediction)
50-
512
10. **Event log**
52-
- Parsing logs and using regular expressions
53-
- lead and lag
543
- geo IP via range query
554
- sessionizing, user paths
565
- abusing a webserver for testing
57-
- Histograms and time series of pageviews
58-
- Anomaly detection on Wikipedia Pageviews
59-
- windowing and rolling statistics
60-
- correlation of joint timeseries
61-
- Holt-Winters
62-
- Correlations
6+
- One more topic from the chopping block, below
637
648
11. **Geo Analysis**
659
- quad keys for point density heat map
@@ -72,8 +16,6 @@
7216
- joining stadiums onto quads
7317
- breaking voronoi regions into multi-scale quads
7418
- map weather observations to cells, average
75-
- spatial join of points and multi-scale quads
76-
- spatial join of quads on quads ("range" query)
7719
7820
12. **Text Analysis**
7921
- grep'ing etc for simple matches
@@ -83,14 +25,25 @@
8325
- group decorate flatten to get rates
8426
- good turing to knock back
8527
- pointwise mutual information to see words
28+
29+
30+
10. **Event log** (Chopping Block topics)
31+
- Parsing logs and using regular expressions
32+
- lead and lag; windowing and rolling statistics
33+
- Histograms and time series of pageviews
34+
- Anomaly detection on Wikipedia Pageviews
35+
- correlation of joint timeseries
36+
- Holt-Winters
37+
- Correlations
38+
39+
11. **Geo Analysis** (Chopping Block)
40+
- spatial join of points and multi-scale quads
41+
- spatial join of quads on quads ("range" query)
42+
43+
12. **Text Analysis** (Chopping Block topics)
8644
- Minhashing to combat a massive feature space
8745
- How to cheat with Bloom filters
88-
-
8946
90-
13. **Data Munging (Semi-Structured Data)**
91-
- Wikipedia for character encoding
92-
- airports for reconciliation
93-
- weather: parsing flat pack file
9447
9548
14. **Statistics**
9649
- subsetting / sampling your data: consistent sampling, distributions, replacement
@@ -112,6 +65,11 @@
11265
- tuning
11366
- why algebraic UDFs are awesome and how to be algebraic
11467
68+
13. **Data Munging (Semi-Structured Data)**
69+
- Wikipedia for character encoding
70+
- airports for reconciliation
71+
- weather: parsing flat pack file
72+
11573
21. *Hadoop Internals*
11674
- What happens when a job is launched
11775
- A shallow dive into the HDFS
@@ -121,6 +79,58 @@
12179
- Tuning for the Brave and Foolish
12280
- The USE Method
12381
23. **Data Modeling for HBase-style Database**
82+
83+
84+
=== Chopping Block
85+
86+
87+
5. **Pipelineable** Additions/Improvements
88+
- Controlling Case Sensitivity in String Comparisons (`ff = FILTER fr BY EqualsIgnoreCase(franch_id, 'bOs'); DUMP ff;`)
89+
- Select Records Using a List of Values
90+
- very small inline list with the `CASE` statement -- `CASE X WHEN val1 ... WHEN val2 ... ELSE .. END` and `CASE WHEN cond .. WHEN cond .. ELSE .. END`)
91+
- Transforming Strings with Regular Expressions
92+
- Transforming Nulls into Real Values
93+
- Converting a Number to its String Representation (and Back) (cast with (int))
94+
- Embedding Quotes and Special Characters Within String Literals.
95+
- JSON encoding/decoding on a value (vs on a record)
96+
- Assigning a Unique Identifier to Each Record (use `-tagPath` when loading; may require most recent Pig)
97+
- `$0` references; `*` and `..` refereces
98+
99+
- Flattening a tuple gives you columns; Flattening a bag gives you rows
100+
- Flattening bags == cross product
101+
- Transposing Columns into Records (make the inline bag from several fields, then flatten it)
102+
- Converting an Internally-Delimited Field into Multiple Columns Using STRSPLIT
103+
- Converting an Internally-Delimited Field into Multiple Rows Using STRSPLITBAG
104+
- Exploding a String into its Characters
105+
- demonstrate case and ternary statements (combine/move demonstration in filter section?)
106+
107+
6. **Grouping** Additions/Improvements
108+
- JSON-encoded string
109+
- completely summarizing
110+
111+
112+
7. **Joining** Additions/Improvements
113+
- Replicated join
114+
- stuff in "important notes about joins"
115+
116+
8. **Ordering and Uniquing**
117+
- Demonstrate Sort in map/reduce
118+
- max with/without ties, with/without record
119+
- top-k with/without ties, with/without record
120+
- running min/max
121+
- mode (make an exercise)
122+
- cardinality ie count of distict values
123+
124+
9. **Advanced Patterns**
125+
- Better COGROUP
126+
- Stitch and Over
127+
- multi-join,
128+
- master-detail
129+
- z-score
130+
- group/decorate/flatten
131+
- group/flatten/re-flatten
132+
- cube & rollup
133+
- run expectancy (prediction)
124134

125135
27. **Intro to Storm+Trident**
126136
28. **Machine Learning without Grad School**:

01-opening.asciidoc

Lines changed: 83 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,87 @@
11
== Insight comes from Data in Context
22

3+
We could start by telling you how awesome and important Big Data is -- that we can now comprehensively measure
4+
Aspects ,
5+
That this lets us extract insight to essential but formerly unquantifiable qualities such as "audience engagement", "unexpected event", "..."
6+
But since you're already reading this paragraph,
7+
We'll instead
8+
(Later on there are sections on "how to Explain Big Data to your Boss" and "What is Big Data (besides \'A Really Good Marketing Device')?"
9+
10+
Instead let's talk about robots and humans.
11+
12+
In year x, Kasparov v deep blue.
13+
Then year y, deep blue wins
14+
So computers have bested humans. Hang it up and go home.
15+
The marketing hype is
16+
Big data lets us
17+
Unprecedented power. And it's easy to make the mistake
18+
Of becoming thrall to it -- While we agree that "In God We Trust, All Others Bring Data", It is insufficient to draw on only what can be quantified with sufficient fidelity
19+
20+
The power of the big data tools comes from taking away ability to
21+
22+
.Garry Kasparov, "The Chess Master and the Computer", 2010
23+
________
24+
[In 1996] I narrowly defeated the supercomputer Deep Blue in a match. Then, in 1997, IBM redoubled its efforts—and doubled Deep Blue’s processing power—and I lost the rematch in an event that made headlines around the world. The result was met with astonishment and grief by those who took it as a symbol of mankind’s submission before the almighty computer. (“The Brain’s Last Stand” read the Newsweek headline.) Others shrugged their shoulders, surprised that humans could still compete at all against the enormous calculating power that, by 1997, sat on just about every desk in the first world. ... no one understood all the ramifications of having a super-grandmaster on your laptop, especially what this would mean for professional chess.
25+
26+
There have been many unintended consequences, both positive and negative, of the rapid proliferation of powerful chess software. Kids love computers and take to them naturally, so it’s no surprise that the same is true of the combination of chess and computers. With the introduction of super-powerful software it became possible for a youngster to have a top- level opponent at home instead of needing a professional trainer from an early age. Countries with little by way of chess tradition and few available coaches can now produce prodigies. I am in fact coaching one of them this year, nineteen-year-old Magnus Carlsen, from Norway, where relatively little chess is played.
27+
28+
The heavy use of computer analysis has pushed the game itself in new directions. The machine doesn’t care about style or patterns or hundreds of years of established theory. It counts up the values of the chess pieces, analyzes a few billion moves, and counts them up again. (A computer translates each piece and each positional factor into a value in order to reduce the game to numbers it can crunch.) It is entirely free of prejudice and doctrine and this has contributed to the development of players who are almost as free of dogma as the machines with which they train. Increasingly, a move isn’t good or bad because it looks that way or because it hasn’t been done that way before. It’s simply good if it works and bad if it doesn’t. Although we still require a strong measure of intuition and logic to play well, humans today are starting to play more like computers.
29+
30+
The availability of millions of games at one’s fingertips in a database is also making the game’s best players younger and younger. Absorbing the thousands of essential patterns and opening moves used to take many years, a process indicative of Malcolm Gladwell’s “10,000 hours to become an expert” theory as expounded in his recent book Outliers. (Gladwell’s earlier book, Blink, rehashed, if more creatively, much of the cognitive psychology material that is re-rehashed in Chess Metaphors.) Today’s teens, and increasingly pre-teens, can accelerate this process by plugging into a digitized archive of chess information and making full use of the superiority of the young mind to retain it all. In the pre-computer era, teenage grandmasters were rarities and almost always destined to play for the world championship. Bobby Fischer’s 1958 record of attaining the grandmaster title at fifteen was broken only in 1991. It has been broken twenty times since then, with the current record holder, Ukrainian Sergey Karjakin, having claimed the highest title at the nearly absurd age of twelve in 2002. Now twenty, Karjakin is among the world’s best, but like most of his modern wunderkind peers he’s no Fischer, who stood out head and shoulders above his peers—and soon enough above the rest of the chess world as well.
31+
32+
In what Rasskin-Gutman explains as Moravec’s Paradox, in chess, as in so many things, what computers are good at is where humans are weak, and vice versa. This gave me an idea for an experiment. What if instead of human versus machine we played as partners?
33+
34+
Having a computer partner also meant never having to worry about making a tactical blunder. The computer could project the consequences of each move we considered, pointing out possible outcomes and countermoves we might otherwise have missed. With that taken care of for us, we could concentrate on strategic planning instead of spending so much time on calculations. Human creativity was even more paramount under these conditions. A month earlier I had defeated the Bulgarian in a match of “regular” rapid chess 4–0. Our advanced chess match ended in a 3–3 draw. My advantage in calculating tactics had been nullified by the machine.
35+
36+
37+
38+
In 2005, the online chess-playing site Playchess.com hosted what it called a “freestyle” chess tournament in which anyone could compete in teams with other players or computers. ... Several groups of strong grandmasters working with several computers at the same time entered the competition. At first, the results seemed predictable. The teams of human plus machine dominated even the strongest computers. The [top chess machines] were no match for a strong human player using a relatively weak laptop. Human strategic guidance combined with the tactical acuity of a computer was overwhelming.
39+
40+
The surprise came at the conclusion of the event. The winner was revealed to be not a grandmaster with a state-of-the-art PC but a pair of amateur American chess players using three computers at the same time. Their skill at manipulating and “coaching” their computers to look very deeply into positions effectively counteracted the superior chess understanding of their grandmaster opponents and the greater computational power of other participants. Weak human + machine + better process was superior to a strong computer alone and, more remarkably, superior to a strong human + machine + inferior process. http://www.nybooks.com/articles/archives/2010/feb/11/the-chess-master-and-the-computer/
41+
________
42+
43+
The goal of this book is that you become just such an expert coach. You don't need to be a grandmaster in statistics, have
44+
What you do need is intuition about how to
45+
You don't need to be an expert programmer. We favor short, elegant readable scripts
46+
You don't need to have reached the third dan of dragon-lightning form in database
47+
48+
What you need is intuition about how data moves around
49+
If you can predict the execution, you can know when to invest in improving it and when something funny is going on
50+
Strategic execution
51+
More importantly know how to turn the measurements you have into the data you need
52+
How to augment
53+
54+
55+
This book will show you how to coach
56+
the computer, how to apply superior process.
57+
58+
We have a principle "Robots are cheap, Humans are important,
59+
(Math about getting soda from the fridge, about running a computer in the cloud)
60+
61+
We start by demonstrating the internal mechanics of Hadoop
62+
Exactly and only deep enough that you can understand how data moves around
63+
In a Big Data system, motion of data (not CPU) is nearly always the dominant cost of compute.
64+
Memory capacity is nearly always the fundamental constraint of computation.
65+
66+
One nice thing about big data is that the performance estimation is brutally stark -- ...
67+
(The not-as-nice is that it when it is bad it is impossible)
68+
69+
Once you have a physical intuition of what's happening, we move to tactics.
70+
We consulted the leading SQL cookbooks to find what patterns of use
71+
(And tricks of the trade) decades of practice have defined.
72+
Screw "NoSQL". Throwing out the old lore is always a bad plan.
73+
74+
// four levels: explain, optimize, predict, control (operations research blog)
75+
76+
77+
78+
Tracking every path your delivery tricks take
79+
Fleet improve fuel usage, safety for driver and the rest of us, operating efficiency and costs.
80+
81+
82+
83+
84+
385
// IMPROVEME: put in an interlude that is JT & Nanette meeting. (Told as a flashforward.)
486

587
Data is worthless. Actually, it's worse than worthless: it requires money and effort to collect, store, transport and organize. Nobody wants data.
@@ -36,7 +118,7 @@ This does _not_ follow the accepted path to truth, namely the Scientific Method.
36118

37119
This new path to truth is what Peter Norvig (Google's Director of Research) calls "http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/35179.pdf[The unreasonable effectiveness of data]". You don't have to start with a model and you don't necessarily end up with a model. There's no simplification of the universe down to a smaller explanation you can carry forward. Sure, we can apply domain knowledge and say that the correspondence of Lynrd Skynrd with Taxidermy means the robots have captured the notion of "Southern-ness". But for applying the result in practice, there's no reason to do so. The algorithms have replaced a uselessly complicated thing (the trillions of associations possible from interest to product category) with an _actionably_ complicated thing (a scoring of what categories to probabilistically present based on interest). You haven't confirmed a falsifiable hypothesis. But you can win at the track.
38120

39-
The proposition that the Unreasonaly-Effective Method is a worthwhile rival to the Scientific Method is sure to cause barroom brawls at scientific conferences for years to come. This book will not go deeply into advanced algorithms, but we will repeatedly see examples of Unreasonable Effectiveness, as the data comes forth with patterns of its own.
121+
The proposition that the Unreasonaly-Effective Method is a worthwhile rival to the Scientific Method is sure to cause barroom brawls at scientific conferences for years to come. This book will not go deeply into advanced algorithms, but we will repeatedly see examples of Unreasonable Effectiveness, as the data comes forth with patterns of its own.
40122

41123
=== The Answer to the Crisis
42124

03-map_reduce.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,7 @@ This means that:
412412
* `apple` and `zoo` come before `шимпанзе`, because the basic ASCII-like characters (like the ones on a US keyboard) precede extended unicode-like characters (like the russian characters in the word for "chimpanzee").
413413
* `###` (hash marks) come before `Apple` and `zoo`; and `||||` (pipes) comes after all of them. Remember these characters -- they are is useful for forcing a set of records to the top or bottom of your input, a trick we'll use in the geodata chapter (REF). The dot (`.`), hyphen (`-`), plus (`+`) hash (`#`) come near the start of the 7-bit ASCII alphanumeric set. The tilde (`~`), pipe (`|`) come at the end. All of them precede extended-character words like `шимпанзе`.
414414

415+
.Beware the Derp-Sort
415416
NOTE: It's very important to recognize that _numbers are not sorted by their numeric value unless you have control over their Java type_. The simplest way to get numeric sorting of positive numbers is to pad numeric outputs a constant width by prepended spaces. In Ruby, the expression `%10d" % val` produces an ten-character wide string (wide enough for all positive thirty-two bit numbers). There's no good way in basic Hadoop Streaming to get negative numbers to sort properly -- yes, this is very annoying. (TECHREVIEW: is there a good way?)
416417

417418
In the common case, the partition key, group key and sort key are the same, because all you care is that records are grouped. But of course it's also common to have the three keys not be the same. The prior example, (REF) a JOIN of two tables, demonstrated a common pattern for use of the secondary sort; and the roll-up aggregation example that follows illustrates both a secondary sort and a larger partition key than group key.

06-analytic_patterns-structural_operations-grouping.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1360,6 +1360,7 @@ never_sox = FOREACH player_soxness_g GENERATE group AS player_id;
13601360
* _Records_ -- List of keys
13611361
* _Data Flow_ -- Map, Combiner & Reducer. Combiners should be extremely effective.
13621362

1363+
13631364
=== Refs
13641365

13651366
* http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0057753[Born at the Wrong Time: Selection Bias in the NHL Draft] by Robert O. Deaner, Aaron Lowen, Stephen Cobley. February 27, 2013DOI: 10.1371/journal.pone.0057753

0 commit comments

Comments
 (0)