Skip to content

johncoder-30/vpc-nlp-translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aug 2021 - Venmurasu Programming Contest

Stage-1: Building Parallel Dataset

  • We looped through the folder containing 20 paragraphs of text file in both english and tamil.
  • read the individual text file separated it by full stop, then used regex to remove items in brackets.
  • the edited line was written line by line seperated by newline.
  • next paragraph is separated by an empty line
  • english paragraphs were written in english text file and tamil in tamil text file
  • the code we used for this is under data folder named data_preprocessing.py
  • then we manually corrected one to many and many to one statements
  • the tamil text is in file tam_equalized.txt with its equal english line in elg_equalized.txt under data folder

Stage-2: Running a Neural Machine Translation Model

Stage-3: Computing the scores of translation

  • Then for BLEU score we used the NLTK library and passed the english translated sentences(translated.txt)and the english sentences(eng_equalized.txt)
  • the code we used for this is under evaluation folder named bleu_score.py

Stage-4: (Bonus points) Compare the model with other services

  • Then using this library (https://github.com/UlionTse/translators) we translated the tamil sentences(tam_equalized.txt)
  • the translated sentences are under translation folder named ulison translate.txt
  • then calculated the BLEU scores using the same NLTK library by passing the translated sentence(ulison translate.txt) and it's english sentence(eng_equalized.txt)
  • the code we used for this is under evaluation folder named bleu_score.py

BLEU score computation for translated sentence using AI4BHARAT Model

  • BLEU score of para 1 is 0.003816793893129772
  • BLEU score of para 2 is 0.019354838709677424
  • BLEU score of para 3 is 0.006864988558352401
  • BLEU score of para 4 is 0.023758099352051837
  • BLEU score of para 5 is 0.006993006993006996
  • BLEU score of para 6 is 0.027848101265822784
  • BLEU score of para 7 is 0.03237410071942446
  • BLEU score of para 8 is 0.023992322456813823
  • BLEU score of para 9 is 0.01955555555555556
  • BLEU score of para 10 is 0.023578363384188623

BLEU score computation for translated sentence using UlionTse library

  • BLEU score of para 1 is 0.011363636363636359
  • BLEU score of para 2 is 0.005181347150259067
  • BLEU score of para 3 is 0.021739130434782608
  • BLEU score of para 4 is 0.002881844380403458
  • BLEU score of para 5 is 0.048701298701298704
  • BLEU score of para 6 is 0.017751479289940832
  • BLEU score of para 7 is 0.02727272727272727
  • BLEU score of para 8 is 0.021857923497267763
  • BLEU score of para 9 is 0.015217391304347823
  • BLEU score of para 10 is 0.012882447665056364

BLEU score comparision

score_comparision

About

translating tamil data to english

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages