Part 1.1: DATAPROC USAGE

Create directory uks8451-bd23

Command:
```
hadoop fs -mkdir uks8451-bd23
```
Screenshot:
Extract Input files in hw1.2 directory

Commands:

    unzip hw1.2.zip
    hadoop dfs -mkdir hw1.2
    hadoop dfs -put hw1.2
    hadoop dfs -ls hw1.2

Screenshots:

Part 1.2: N_GRAM CONDITIONAL PROBABILITY

TLDR: `bash commands.sh`

Job 1: Unigram Frequencies

Input: text data
Mapper: python n_gram_count_mapper.py n, min_words
n => 1 for unigram
min_words=> 3 (Minimum words a line should have to be considered valid)
Outputs: <uni_gram><tab>1
Reducer: python n_gram_count_reducer.py
Outputs: <uni_gram><tab><count>

Command:

hadoop fs -rm -r 1_grams ; 
mapred streaming -file n_gram_count_mapper.py \
-file n_gram_count_reducer.py \
-input hw1.2/* \
-output 1_grams \
-mapper "python n_gram_count_mapper.py 1 3" \
-reducer "python n_gram_count_reducer.py

Screenshot

Job 2: Unigram Total Count

Input: Unigram Frequencies
Mapper: cat
Combiner: python n_gram_sum_reducer
Reducer: python n_gram_sum_reducer
Outputs: sum<tab><total_count>

Command:

hadoop fs -rm -r 1_grams_sum; \
mapred streaming -file n_gram_count_mapper.py \
-file n_gram_sum_reducer.py \
-input 1_grams/* \
-output 1_grams_sum \
-mapper "cat" \
-combiner "python n_gram_sum_reducer.py" \
-reducer "python n_gram_sum_reducer.py"

Screenshot

Job 3: Unigram Probability

Input: Unigram Frequencies
Mapper: uni_gram_prob_mapper.py <unigram_total_sum>

unigram_total_sum = $(hadoop fs -cat 1_grams_sum/* | cut -f 2)
Reducer: cat
Outputs: <uni_gram><tab><uni_gram_prob>

Command:

hadoop fs -rm -r 1_grams_prob; \
mapred streaming -file uni_gram_prob_mapper.py \
-input 1_grams/* \
-output 1_grams_prob \
-mapper "python uni_gram_prob_mapper.py $(hadoop fs -cat 1_grams_sum\/*  | cut -f 2)" \
-reducer "cat"

Screenshot

Job 4: Bigram Frequencies

Input: text data
Mapper: python n_gram_count_mapper.py n, min_words
n => 2 for bigram
min_words=> 3 (Minimum words a line should have to be considered valid)
Outputs: <bi_gram><tab>1
Reducer: python n_gram_count_reducer.py
Outputs: <bi_gram><tab><count>

Command:

hadoop fs -rm -r 2_grams ; \
mapred streaming -file n_gram_count_mapper.py \
-file n_gram_count_reducer.py \
-input hw1.2/* \
-output 2_grams \
-mapper "python n_gram_count_mapper.py 2 3" \
-reducer "python n_gram_count_reducer.py"

Screenshot

Job 5: Bigram Total Count

Input: Bigram Frequencies
Mapper: cat
Combiner: python n_gram_sum_reducer
Reducer: python n_gram_sum_reducer
Outputs: sum<tab><total_count>

Command:

hadoop fs -rm -r 2_grams_sum; \
mapred streaming -file n_gram_count_mapper.py \
-file n_gram_sum_reducer.py \
-input 2_grams/* \
-output 2_grams_sum \
-mapper "cat" \
-combiner "python n_gram_sum_reducer.py" \
-reducer "python n_gram_sum_reducer.py"

Screenshot

(Final) Job 6: Bigram Conditional Probability

Input: Unigram Probabilities, Bigram Frequencies
Mapper: python n_gram_prob_mapper.py 2 <bigram_total_sum>
bigram_total_sum = $(hadoop fs -cat 2_grams_sum/* | cut -f 2)
Outputs: For every line from bi_gram_counts: <word_1><tab><word2><tab><bi_gram_frequency/bigram_total_sum>
For each line from unigram_probabilities <unigram><tab>-<unigram_probability>
Reducer: python n_gram_prob_reducer.py
Other command options:
- --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
  Specify that the key is made up of multiple fields
- -D stream.num.map.output.key.fields=2
  Specify that the key contains two fields
- -D mapred.text.key.partitioner.options="-k1,1"
  Partition on the first field only to ensure all bigrams and their respective unigrams are provided to the same reducer
- -D mapred.text.key.comparator.options="-k1,2"
  Sort Reducer input by second field ensuring all unigrams precede their respective bigrams in order
Outputs: <bi_gram><tab><bi_gram_conditional_prob>

Command:

hadoop fs -rm -r 2_grams_prob; \
mapred streaming \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.partitioner.options="-k1,1" \
-D mapred.text.key.comparator.options="-k1,2" \
-file n_gram_prob_mapper.py \
-file n_gram_prob_reducer.py \
-input 2_grams/* \
-input 1_grams_prob/* \
-output 2_grams_prob \
-mapper "python n_gram_prob_mapper.py 2 $(hadoop fs -cat 2_grams_sum\/*  | cut -f 2)" \
-reducer "python n_gram_prob_reducer.py" \
--partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"

Screenshot

Part 1.3 Bonus : Most Probable word after "United States"

Job 1,2,3 Calculate Trigram Frequencies, Total and Probabilities

Calculate Trigram Frequencies, Trigram Total Counts and Trigram Conditional Probability using the commands below. It uses the same files and approach as Bigram Frequencies, Bigram Totak Count and Bigram Conditional Probability

Input: text data
Output: Trigram Frequencies

Commands:

# Trigram Frequencies
hadoop fs -rm -r 3_grams ; \
mapred streaming -file n_gram_count_mapper.py \
-file n_gram_count_reducer.py \
-input hw1.2/* \
-output 3_grams \
-mapper "python n_gram_count_mapper.py 3 3" \
-reducer "python n_gram_count_reducer.py";

# Trigram Total Count
hadoop fs -rm -r 3_grams_sum; \
mapred streaming -file n_gram_count_mapper.py \
-file n_gram_sum_reducer.py \
-input 3_grams/* \
-output 3_grams_sum \
-mapper "cat" \
-numReduceTasks 1 \
-reducer "python n_gram_sum_reducer.py " ;

# Trigram Conditional Probabilities
hadoop fs -rm -r 3_grams_prob; \
mapred streaming \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.partitioner.options="-k1,1" \
-D mapred.text.key.comparator.options="-k1,2" \
-file n_gram_prob_mapper.py \
-file n_gram_prob_reducer.py \
-input 3_grams/* \
-input 2_grams_prob/* \
-output 3_grams_prob \
-mapper "python n_gram_prob_mapper.py 3 $(hadoop fs -cat 3_grams_sum\/*  | cut -f 2)" \
-reducer "python n_gram_prob_reducer.py" \
--partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner";

Screenshot

Job 4:

Sort the Trigram conditional probabilities by probability value.

Input: Trigram Conditional Probabilities
Output: Trigrams sorted by Conditional Probabilities
Mapper: python n_gram_sort_mapper.py Outputs: <word1><space><word2><tab><probability><tab><word3>
Reducer: cat
Output: <word1><space><word2><tab><probability><tab><word3>Sorted in ascending based on probability
Other command options:
- --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
  Specify that the key is made up of multiple fields
- -D stream.num.map.output.key.fields=2
  Specify that the key contains two fields
- -D mapred.text.key.partitioner.options="-k1,1"
  Partition on the first field only to ensure all bigrams and their respective unigrams are provided to the same reducer
- -D mapred.text.key.comparator.options="-k1,2"
  Sort Reducer input by second field ensuring all unigrams precede their respective bigrams in order

Command:

hadoop fs -rm -r 3_grams_sorted; mapred streaming \
-D stream.num.map.output.key.fields=3 \
-D mapred.text.key.partitioner.options="-k1,1" \
-D mapred.text.key.comparator.options="-k1,2" \
-file n_gram_sort_mapper.py \
-input 3_grams_prob/* \
-mapper "python n_gram_sort_mapper.py" \
-output 3_grams_sorted \
-reducer "cat" \
--partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"

Screeshot

Final Output

Command:

hadoop fs -cat 3_grams_sorted/* | grep -P "^united states\t" | tail -n 1;

Screenshot:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Part 1.1: DATAPROC USAGE

Create directory uks8451-bd23

Part 1.2: N_GRAM CONDITIONAL PROBABILITY

TLDR: `bash commands.sh`

Job 1: Unigram Frequencies

Job 2: Unigram Total Count

Job 3: Unigram Probability

Job 4: Bigram Frequencies

Job 5: Bigram Total Count

(Final) Job 6: Bigram Conditional Probability

Part 1.3 Bonus : Most Probable word after "United States"

Job 1,2,3 Calculate Trigram Frequencies, Total and Probabilities

Job 4:

Final Output

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
screenshots		screenshots
commands.sh		commands.sh
n_gram_count_mapper.py		n_gram_count_mapper.py
n_gram_count_reducer.py		n_gram_count_reducer.py
n_gram_prob_mapper.py		n_gram_prob_mapper.py
n_gram_prob_reducer.py		n_gram_prob_reducer.py
n_gram_sort_mapper.py		n_gram_sort_mapper.py
n_gram_sum_reducer.py		n_gram_sum_reducer.py
readme.md		readme.md
uni_gram_prob_mapper.py		uni_gram_prob_mapper.py

umangshah-js/n_gram_conditional_probability_mapreduce

Folders and files

Latest commit

History

Repository files navigation

Part 1.1: DATAPROC USAGE

Create directory uks8451-bd23

Part 1.2: N_GRAM CONDITIONAL PROBABILITY

TLDR: bash commands.sh

Job 1: Unigram Frequencies

Job 2: Unigram Total Count

Job 3: Unigram Probability

Job 4: Bigram Frequencies

Job 5: Bigram Total Count

(Final) Job 6: Bigram Conditional Probability

Part 1.3 Bonus : Most Probable word after "United States"

Job 1,2,3 Calculate Trigram Frequencies, Total and Probabilities

Job 4:

Final Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

TLDR: `bash commands.sh`

Packages