Hadoop

The emulator can be run using the MapReduce programming model. One popular implementation of MapReduce is Hadoop which can be set up on the CCR cluster to run part of the emulator code.

Note that Rohit found some limitations to Hadoop so the emulator construction was not done using Hadoop. This was done as a batch process on the cluster and the output of the emulator was saved into text files.

When running Hadoop, Rohit used Hadoop Streaming which allows for key/value pairs to be read from stdin and written to stdout. The key is first written and a tab is placed after the key. Anything after the tab character is the value and anything before the tab character is the key.

For example:

foo\tbar

foo is the key and bar is the value ('\t' is the tar character).

Getting emulator output for reduce input

The emulator was not able to run with Hadoop because only two map process were allowed per node. So, for each downsample file run the file emulator.py and pipe its output to newEmulator. The program newEmulator will write its output to panasas with the following information:

resample_phmid sample mean

From this output use the "resample_phmid" as the key and "sample mean" as the value for the reduce operations.

This whole procedure was done as a batch process on the cluster.

Reduce:

There are a couple slurm scripts in your Rohit's my_hadoop directoy, but from looking at all of them, it looks like my_slurm_script is called first and then my_slurm_script_for_final_reduce is called second. Note that there are two reduce procedures (both use the same map function).

my_slurm_script uses map6.py as the mapper. This just reads the newEmulator output files and uses "resample_phmid" as the key and "sample mean" as the value. The reduce file is reduce3.py and this would write its output to panasas for input into the second reduce process.

After the first reduce process runs, my_slurm_script_for_final_reduce is called as the second reduce process with again map6.py as the mapper. This would just read the output files from the previous Hadoop run and break up the input into key/value pairs. it would then use reduce5.py as the reducer and this would output the final format as:

phmid weight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop

Getting emulator output for reduce input

Reduce:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally