Skip to content

umangshah-js/n_gram_conditional_probability_mapreduce

Repository files navigation

Part 1.1: DATAPROC USAGE

Create directory uks8451-bd23

  • Command:

    hadoop fs -mkdir uks8451-bd23
  • Screenshot:

  • Extract Input files in hw1.2 directory

  • Commands:

        unzip hw1.2.zip
        hadoop dfs -mkdir hw1.2
        hadoop dfs -put hw1.2
        hadoop dfs -ls hw1.2
  • Screenshots:

Part 1.2: N_GRAM CONDITIONAL PROBABILITY

TLDR: bash commands.sh

Job 1: Unigram Frequencies

  • Input: text data

  • Mapper: python n_gram_count_mapper.py n, min_words
    n => 1 for unigram
    min_words=> 3 (Minimum words a line should have to be considered valid)
    Outputs: <uni_gram><tab>1

  • Reducer: python n_gram_count_reducer.py

  • Outputs: <uni_gram><tab><count>

  • Command:

    hadoop fs -rm -r 1_grams ; 
    mapred streaming -file n_gram_count_mapper.py \
    -file n_gram_count_reducer.py \
    -input hw1.2/* \
    -output 1_grams \
    -mapper "python n_gram_count_mapper.py 1 3" \
    -reducer "python n_gram_count_reducer.py
  • Screenshot

Job 2: Unigram Total Count

  • Input: Unigram Frequencies
  • Mapper: cat
  • Combiner: python n_gram_sum_reducer
  • Reducer: python n_gram_sum_reducer
  • Outputs: sum<tab><total_count>
  • Command:
    hadoop fs -rm -r 1_grams_sum; \
    mapred streaming -file n_gram_count_mapper.py \
    -file n_gram_sum_reducer.py \
    -input 1_grams/* \
    -output 1_grams_sum \
    -mapper "cat" \
    -combiner "python n_gram_sum_reducer.py" \
    -reducer "python n_gram_sum_reducer.py"
  • Screenshot

Job 3: Unigram Probability

  • Input: Unigram Frequencies

  • Mapper: uni_gram_prob_mapper.py <unigram_total_sum>

    unigram_total_sum = $(hadoop fs -cat 1_grams_sum/* | cut -f 2)

  • Reducer: cat

  • Outputs: <uni_gram><tab><uni_gram_prob>

  • Command:

    hadoop fs -rm -r 1_grams_prob; \
    mapred streaming -file uni_gram_prob_mapper.py \
    -input 1_grams/* \
    -output 1_grams_prob \
    -mapper "python uni_gram_prob_mapper.py $(hadoop fs -cat 1_grams_sum\/*  | cut -f 2)" \
    -reducer "cat"
  • Screenshot

Job 4: Bigram Frequencies

  • Input: text data

  • Mapper: python n_gram_count_mapper.py n, min_words
    n => 2 for bigram
    min_words=> 3 (Minimum words a line should have to be considered valid)
    Outputs: <bi_gram><tab>1

  • Reducer: python n_gram_count_reducer.py

  • Outputs: <bi_gram><tab><count>

  • Command:

    hadoop fs -rm -r 2_grams ; \
    mapred streaming -file n_gram_count_mapper.py \
    -file n_gram_count_reducer.py \
    -input hw1.2/* \
    -output 2_grams \
    -mapper "python n_gram_count_mapper.py 2 3" \
    -reducer "python n_gram_count_reducer.py"
  • Screenshot

Job 5: Bigram Total Count

  • Input: Bigram Frequencies
  • Mapper: cat
  • Combiner: python n_gram_sum_reducer
  • Reducer: python n_gram_sum_reducer
  • Outputs: sum<tab><total_count>
  • Command:
    hadoop fs -rm -r 2_grams_sum; \
    mapred streaming -file n_gram_count_mapper.py \
    -file n_gram_sum_reducer.py \
    -input 2_grams/* \
    -output 2_grams_sum \
    -mapper "cat" \
    -combiner "python n_gram_sum_reducer.py" \
    -reducer "python n_gram_sum_reducer.py"
  • Screenshot

(Final) Job 6: Bigram Conditional Probability

  • Input: Unigram Probabilities, Bigram Frequencies

  • Mapper: python n_gram_prob_mapper.py 2 <bigram_total_sum>
    bigram_total_sum = $(hadoop fs -cat 2_grams_sum/* | cut -f 2)
    Outputs: For every line from bi_gram_counts: <word_1><tab><word2><tab><bi_gram_frequency/bigram_total_sum>
    For each line from unigram_probabilities <unigram><tab>-<unigram_probability>

  • Reducer: python n_gram_prob_reducer.py

  • Other command options:

    • --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
      Specify that the key is made up of multiple fields
    • -D stream.num.map.output.key.fields=2
      Specify that the key contains two fields
    • -D mapred.text.key.partitioner.options="-k1,1"
      Partition on the first field only to ensure all bigrams and their respective unigrams are provided to the same reducer
    • -D mapred.text.key.comparator.options="-k1,2"
      Sort Reducer input by second field ensuring all unigrams precede their respective bigrams in order
  • Outputs: <bi_gram><tab><bi_gram_conditional_prob>

  • Command:

    hadoop fs -rm -r 2_grams_prob; \
    mapred streaming \
    -D stream.num.map.output.key.fields=2 \
    -D mapred.text.key.partitioner.options="-k1,1" \
    -D mapred.text.key.comparator.options="-k1,2" \
    -file n_gram_prob_mapper.py \
    -file n_gram_prob_reducer.py \
    -input 2_grams/* \
    -input 1_grams_prob/* \
    -output 2_grams_prob \
    -mapper "python n_gram_prob_mapper.py 2 $(hadoop fs -cat 2_grams_sum\/*  | cut -f 2)" \
    -reducer "python n_gram_prob_reducer.py" \
    --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
  • Screenshot

Part 1.3 Bonus : Most Probable word after "United States"

Job 1,2,3 Calculate Trigram Frequencies, Total and Probabilities


Calculate Trigram Frequencies, Trigram Total Counts and Trigram Conditional Probability using the commands below. It uses the same files and approach as Bigram Frequencies, Bigram Totak Count and Bigram Conditional Probability
  • Input: text data

  • Output: Trigram Frequencies

  • Commands:

    # Trigram Frequencies
    hadoop fs -rm -r 3_grams ; \
    mapred streaming -file n_gram_count_mapper.py \
    -file n_gram_count_reducer.py \
    -input hw1.2/* \
    -output 3_grams \
    -mapper "python n_gram_count_mapper.py 3 3" \
    -reducer "python n_gram_count_reducer.py";
    
    # Trigram Total Count
    hadoop fs -rm -r 3_grams_sum; \
    mapred streaming -file n_gram_count_mapper.py \
    -file n_gram_sum_reducer.py \
    -input 3_grams/* \
    -output 3_grams_sum \
    -mapper "cat" \
    -numReduceTasks 1 \
    -reducer "python n_gram_sum_reducer.py " ;
    
    # Trigram Conditional Probabilities
    hadoop fs -rm -r 3_grams_prob; \
    mapred streaming \
    -D stream.num.map.output.key.fields=2 \
    -D mapred.text.key.partitioner.options="-k1,1" \
    -D mapred.text.key.comparator.options="-k1,2" \
    -file n_gram_prob_mapper.py \
    -file n_gram_prob_reducer.py \
    -input 3_grams/* \
    -input 2_grams_prob/* \
    -output 3_grams_prob \
    -mapper "python n_gram_prob_mapper.py 3 $(hadoop fs -cat 3_grams_sum\/*  | cut -f 2)" \
    -reducer "python n_gram_prob_reducer.py" \
    --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner";
  • Screenshot



Job 4:


Sort the Trigram conditional probabilities by probability value.
  • Input: Trigram Conditional Probabilities

  • Output: Trigrams sorted by Conditional Probabilities

  • Mapper: python n_gram_sort_mapper.py Outputs: <word1><space><word2><tab><probability><tab><word3>

  • Reducer: cat

  • Output: <word1><space><word2><tab><probability><tab><word3>Sorted in ascending based on probability

  • Other command options:

    • --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
      Specify that the key is made up of multiple fields
    • -D stream.num.map.output.key.fields=2
      Specify that the key contains two fields
    • -D mapred.text.key.partitioner.options="-k1,1"
      Partition on the first field only to ensure all bigrams and their respective unigrams are provided to the same reducer
    • -D mapred.text.key.comparator.options="-k1,2"
      Sort Reducer input by second field ensuring all unigrams precede their respective bigrams in order
  • Command:

    hadoop fs -rm -r 3_grams_sorted; mapred streaming \
    -D stream.num.map.output.key.fields=3 \
    -D mapred.text.key.partitioner.options="-k1,1" \
    -D mapred.text.key.comparator.options="-k1,2" \
    -file n_gram_sort_mapper.py \
    -input 3_grams_prob/* \
    -mapper "python n_gram_sort_mapper.py" \
    -output 3_grams_sorted \
    -reducer "cat" \
    --partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"
  • Screeshot

Final Output

  • Command:
    hadoop fs -cat 3_grams_sorted/* | grep -P "^united states\t" | tail -n 1;
  • Screenshot:

About

Map Reduce + Python Implementation to calculate n-gram Conditional Probability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published