Skip to content

Commit 2852f5b

Browse files
author
Philip (flip) Kromer
committed
feedback distilled
1 parent a577037 commit 2852f5b

File tree

3 files changed

+18
-25
lines changed

3 files changed

+18
-25
lines changed

02-feedback_and_response.asciidoc

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
==== Introduction Structure
22

3+
_Here is the new introduction to the "Hadoop Basics" Chapter two. Does it hit the mark?_
34

5+
In this chapter, we will equip you with two things: the necessary mechanics of working with Hadoop, and a physical intuition for how data and computation move around the cluster during a job.
6+
7+
Hadoop is a large and complex beast. It can be bewildering to even begin to use the system, and so in this chapter we're going to purposefully charge through the least you need to know to launch jobs and manage data. If you hit trouble, anything past that is well-covered in Hadoop's excellent and detailed documentation or online. But don't go looking for trouble! For every one of its many modes options and configurations that is essential, there are many more that are distracting or even dangerous. The most important optimizations you can make come from designing efficient workflows, and even moreso from knowing when to spend highly valuable programmer time to reduce compute time.
8+
9+
The key to doing so is an intuitive, physical understanding of how data moves around a Hadoop cluster. Shipping data from one machine to another -- even from one location on disk to another -- is outrageously costly, and in the vast majority of cases dominates the cost of your job. We'll describe at a high level how Hadoop organizes data and assigns tasks across compute nodes so that as little data as possible is set in motion with both a physical analogy and by following an example job through its full lifecycle. More importantly, we'll show you how to read a job's Hadoop dashboard to understand how much it cost and why. We strongly urge you to gain access to an actual Hadoop cluster (Appendix X (REF) can help) and run jobs. Your goal for this chapter is to take away a basic understanding of how Hadoop distributes tasks and data, and the ability to run a job and see what's going on with it. As you run more and more jobs through the remaining course of the book, it is the latter ability that will cement your intuition.
10+
11+
Let's kick things off by making friends with the good folks at Elephant and Chimpanzee, Inc. Their story should give you an essential physical understanding for the problems Hadoop addresses and how it solves them.
412

513
==== Tell readers what the point of this is before you dive into the example. What are you showing them? Why? What will they get out of it? "I'm going to walk you through an example of ___, which will show you _____ so that you'll begin to understand how _____" for example.
614

02-hadoop_basics.asciidoc

Lines changed: 9 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -7,30 +7,9 @@ In this chapter, we will equip you with two things: the necessary mechanics of w
77

88
Hadoop is a large and complex beast. It can be bewildering to even begin to use the system, and so in this chapter we're going to purposefully charge through the least you need to know to launch jobs and manage data. If you hit trouble, anything past that is well-covered in Hadoop's excellent and detailed documentation or online. But don't go looking for trouble! For every one of its many modes options and configurations that is essential, there are many more that are distracting or even dangerous. The most important optimizations you can make come from designing efficient workflows, and even moreso from knowing when to spend highly valuable programmer time to reduce compute time.
99

10-
The key to doing so is an intuitive, physical understanding of how data moves around a Hadoop cluster.
10+
The key to doing so is an intuitive, physical understanding of how data moves around a Hadoop cluster. Shipping data from one machine to another -- even from one location on disk to another -- is outrageously costly, and in the vast majority of cases dominates the cost of your job. We'll describe at a high level how Hadoop organizes data and assigns tasks across compute nodes so that as little data as possible is set in motion with both a physical analogy and by following an example job through its full lifecycle. More importantly, we'll show you how to read a job's Hadoop dashboard to understand how much it cost and why. We strongly urge you to gain access to an actual Hadoop cluster (Appendix X (REF) can help) and run jobs. Your goal for this chapter is to take away a basic understanding of how Hadoop distributes tasks and data, and the ability to run a job and see what's going on with it. As you run more and more jobs through the remaining course of the book, it is the latter ability that will cement your intuition.
1111

12-
how data moves around a hadoop cluster
13-
how much that costs
14-
15-
The focus of this chapter is on building your intuition on
16-
how much data should be processed and how much that should cost
17-
how much data was processed and how much it did cost.
18-
19-
how and why Hadoop distributes data across the machines in a cluster
20-
how much it costs to
21-
overhead
22-
23-
basis for comparing human costs to cluster costs
24-
25-
26-
How much data was moved
27-
28-
This chapter will only look at "embarras
29-
30-
Shipping data from one machine to another -- even from one location on disk to another -- is outrageously costly
31-
How
32-
33-
// (If you're already familiar with the basics of using Hadoop and are too anxious to get to the specifics of working with data, skip ahead to Chapter 4)
12+
Let's kick things off by making friends with the good folks at Elephant and Chimpanzee, Inc. Their story should give you an essential physical understanding for the problems Hadoop addresses and how it solves them.
3413

3514
.Chimpanzee and Elephant Start a Business
3615
******
@@ -49,7 +28,7 @@ The fact that each chimpanzee's work is independent of any other's -- no interof
4928

5029
=== Map-only Jobs: Process Records Individually ===
5130

52-
As you'd guess, the way Chimpanzee and Elephant organize their files and workflow corresponds directly with how Hadoop handles data and computation under the hood. We can now use it to walk you through some examples.
31+
As you'd guess, the way Chimpanzee and Elephant organize their files and workflow corresponds directly with how Hadoop handles data and computation under the hood. We can now use it to walk you through an example in detail.
5332

5433
We may not be as clever as JT's multilingual chimpanzees, but even we can translate text into a language we'll call _Igpay Atinlay_. footnote:[Sharp-eyed readers will note that this language is really called _Pig Latin._ That term has another name in the Hadoop universe, though, so we've chosen to call it Igpay Atinlay -- Pig Latin for "Pig Latin".]. For the unfamiliar, here's how to http://en.wikipedia.org/wiki/Pig_latin#Rules[translate standard English into Igpay Atinlay]:
5534

@@ -260,7 +239,13 @@ The one important detail to learn in all this is that _task trackers do not run
260239
// === The Cost of a Job
261240

262241

242+
// So one of the key concepts behind Map/Reduce is the idea of "moving the compute to the data". Hadoop stores data locally across multiple
243+
//
244+
// The focus of this chapter is on building your intuition on
245+
// how much data should be processed and how much that should cost
246+
// how much data was processed and how much it did cost.
263247

248+
264249
// === Outro
265250
//
266251
// In the next chapter, you'll learn about map/reduce jobs -- the full power of Hadoop's processing paradigm.. Let's start by joining JT and Nannette with their next client.

book.asciidoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
= Big Data for Chimps
22

3-
include::11a-geodata-intro.asciidoc[]
3+
include::02-feedback_and_response.asciidoc[]

0 commit comments

Comments
 (0)