-
Command:
hadoop fs -mkdir uks8451-bd23
-
Extract Input files in hw1.2 directory
-
Commands:
unzip hw1.2.zip hadoop dfs -mkdir hw1.2 hadoop dfs -put hw1.2 hadoop dfs -ls hw1.2
-
Input: text data
-
Mapper:
python n_gram_count_mapper.py n, min_words
n => 1 for unigram
min_words=> 3 (Minimum words a line should have to be considered valid)
Outputs:<uni_gram><tab>1
-
Reducer:
python n_gram_count_reducer.py
-
Outputs:
<uni_gram><tab><count>
-
Command:
hadoop fs -rm -r 1_grams ; mapred streaming -file n_gram_count_mapper.py \ -file n_gram_count_reducer.py \ -input hw1.2/* \ -output 1_grams \ -mapper "python n_gram_count_mapper.py 1 3" \ -reducer "python n_gram_count_reducer.py
- Input: Unigram Frequencies
- Mapper:
cat
- Combiner:
python n_gram_sum_reducer
- Reducer:
python n_gram_sum_reducer
- Outputs:
sum<tab><total_count>
- Command:
hadoop fs -rm -r 1_grams_sum; \ mapred streaming -file n_gram_count_mapper.py \ -file n_gram_sum_reducer.py \ -input 1_grams/* \ -output 1_grams_sum \ -mapper "cat" \ -combiner "python n_gram_sum_reducer.py" \ -reducer "python n_gram_sum_reducer.py"
- Screenshot
-
Input: Unigram Frequencies
-
Mapper:
uni_gram_prob_mapper.py <unigram_total_sum>
unigram_total_sum = $(hadoop fs -cat 1_grams_sum/* | cut -f 2)
-
Reducer:
cat
-
Outputs:
<uni_gram><tab><uni_gram_prob>
-
Command:
hadoop fs -rm -r 1_grams_prob; \ mapred streaming -file uni_gram_prob_mapper.py \ -input 1_grams/* \ -output 1_grams_prob \ -mapper "python uni_gram_prob_mapper.py $(hadoop fs -cat 1_grams_sum\/* | cut -f 2)" \ -reducer "cat"
-
Input: text data
-
Mapper:
python n_gram_count_mapper.py n, min_words
n => 2 for bigram
min_words=> 3 (Minimum words a line should have to be considered valid)
Outputs:<bi_gram><tab>1
-
Reducer:
python n_gram_count_reducer.py
-
Outputs:
<bi_gram><tab><count>
-
Command:
hadoop fs -rm -r 2_grams ; \ mapred streaming -file n_gram_count_mapper.py \ -file n_gram_count_reducer.py \ -input hw1.2/* \ -output 2_grams \ -mapper "python n_gram_count_mapper.py 2 3" \ -reducer "python n_gram_count_reducer.py"
- Input: Bigram Frequencies
- Mapper:
cat
- Combiner:
python n_gram_sum_reducer
- Reducer:
python n_gram_sum_reducer
- Outputs:
sum<tab><total_count>
- Command:
hadoop fs -rm -r 2_grams_sum; \ mapred streaming -file n_gram_count_mapper.py \ -file n_gram_sum_reducer.py \ -input 2_grams/* \ -output 2_grams_sum \ -mapper "cat" \ -combiner "python n_gram_sum_reducer.py" \ -reducer "python n_gram_sum_reducer.py"
- Screenshot
-
Input: Unigram Probabilities, Bigram Frequencies
-
Mapper:
python n_gram_prob_mapper.py 2 <bigram_total_sum>
bigram_total_sum = $(hadoop fs -cat 2_grams_sum/* | cut -f 2)
Outputs: For every line from bi_gram_counts:<word_1><tab><word2><tab><bi_gram_frequency/bigram_total_sum>
For each line from unigram_probabilities<unigram><tab>-<unigram_probability>
-
Reducer:
python n_gram_prob_reducer.py
-
Other command options:
- --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
Specify that the key is made up of multiple fields - -D stream.num.map.output.key.fields=2
Specify that the key contains two fields - -D mapred.text.key.partitioner.options="-k1,1"
Partition on the first field only to ensure all bigrams and their respective unigrams are provided to the same reducer - -D mapred.text.key.comparator.options="-k1,2"
Sort Reducer input by second field ensuring all unigrams precede their respective bigrams in order
- --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
-
Outputs:
<bi_gram><tab><bi_gram_conditional_prob>
-
Command:
hadoop fs -rm -r 2_grams_prob; \ mapred streaming \ -D stream.num.map.output.key.fields=2 \ -D mapred.text.key.partitioner.options="-k1,1" \ -D mapred.text.key.comparator.options="-k1,2" \ -file n_gram_prob_mapper.py \ -file n_gram_prob_reducer.py \ -input 2_grams/* \ -input 1_grams_prob/* \ -output 2_grams_prob \ -mapper "python n_gram_prob_mapper.py 2 $(hadoop fs -cat 2_grams_sum\/* | cut -f 2)" \ -reducer "python n_gram_prob_reducer.py" \ --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
Calculate Trigram Frequencies, Trigram Total Counts and Trigram Conditional Probability using the commands below. It uses the same files and approach as Bigram Frequencies, Bigram Totak Count and Bigram Conditional Probability
-
Input: text data
-
Output: Trigram Frequencies
-
Commands:
# Trigram Frequencies hadoop fs -rm -r 3_grams ; \ mapred streaming -file n_gram_count_mapper.py \ -file n_gram_count_reducer.py \ -input hw1.2/* \ -output 3_grams \ -mapper "python n_gram_count_mapper.py 3 3" \ -reducer "python n_gram_count_reducer.py"; # Trigram Total Count hadoop fs -rm -r 3_grams_sum; \ mapred streaming -file n_gram_count_mapper.py \ -file n_gram_sum_reducer.py \ -input 3_grams/* \ -output 3_grams_sum \ -mapper "cat" \ -numReduceTasks 1 \ -reducer "python n_gram_sum_reducer.py " ; # Trigram Conditional Probabilities hadoop fs -rm -r 3_grams_prob; \ mapred streaming \ -D stream.num.map.output.key.fields=2 \ -D mapred.text.key.partitioner.options="-k1,1" \ -D mapred.text.key.comparator.options="-k1,2" \ -file n_gram_prob_mapper.py \ -file n_gram_prob_reducer.py \ -input 3_grams/* \ -input 2_grams_prob/* \ -output 3_grams_prob \ -mapper "python n_gram_prob_mapper.py 3 $(hadoop fs -cat 3_grams_sum\/* | cut -f 2)" \ -reducer "python n_gram_prob_reducer.py" \ --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner";
Sort the Trigram conditional probabilities by probability value.
-
Input: Trigram Conditional Probabilities
-
Output: Trigrams sorted by Conditional Probabilities
-
Mapper:
python n_gram_sort_mapper.py
Outputs:<word1><space><word2><tab><probability><tab><word3>
-
Reducer:
cat
-
Output:
<word1><space><word2><tab><probability><tab><word3>
Sorted in ascending based on probability -
Other command options:
- --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
Specify that the key is made up of multiple fields - -D stream.num.map.output.key.fields=2
Specify that the key contains two fields - -D mapred.text.key.partitioner.options="-k1,1"
Partition on the first field only to ensure all bigrams and their respective unigrams are provided to the same reducer - -D mapred.text.key.comparator.options="-k1,2"
Sort Reducer input by second field ensuring all unigrams precede their respective bigrams in order
- --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
-
Command:
hadoop fs -rm -r 3_grams_sorted; mapred streaming \ -D stream.num.map.output.key.fields=3 \ -D mapred.text.key.partitioner.options="-k1,1" \ -D mapred.text.key.comparator.options="-k1,2" \ -file n_gram_sort_mapper.py \ -input 3_grams_prob/* \ -mapper "python n_gram_sort_mapper.py" \ -output 3_grams_sorted \ -reducer "cat" \ --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"