Skip to content

Joao-M-Almeida/JEECUnbabelChallenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JEECUnbabelChallenge

Participation on the JEEC Unbabel Challenge.

At JEEC16 (a Electrical and Computer Engineering event organized by the student at Técnico) Unbabel (a Portuguese startup that does human powered machine translation) went to give a talk and proposed a challenge to the students attendine. We (João Almeida and Ricardo Martins) decided to participate, this repository contains all the work we did during the 11-day Challenge, from 11/03/2016 to 21/03/2016.

####Goal: Create a ML classifier to distinguish between machine translations and human translations.

####Data Used: The available dataset consisted of about 20K labeled phrases. And the final test set of 3220 phrases to classify.

####Scripts:

  • FeatureGen: Script to calculate Handmade features created with the intention to find translater errors;
  • ClassifiersTest: Script to test different classifiers on the training data and then fine tune their parameters. Created from the template at DataScienceTools;
  • ClassifyTestSet: Script to train the final classifier and generate the predictions for the test set;
  • ProcessTextIntoSamples: Script to transform external text into usable text samples;
  • PostTag: Set of functions to convert phrases into a n-gram POS tagged representation of the text;
  • FirstTests: Mix of testing and POS tagging; TODO: clean up;

####Approach:

  • TODO: Explain approach.

####Final Results: Voting Classifier based on LDA, LogisticRegression and AdaBoost with Dimensionality reduction by Pricinpal component analysis achieved a cross validation score of (55.0 +/- 1.6)% and a final result of 57.8%.

#####Global Results:

  1. Francisco Dias: 59.72%
  2. Catarina Silva: 59.22%
  3. Miguel Borges Ribeiro, Tiago Baltazar: 58.39%
  4. João Almeida, Ricardo Martins: 57.80%
  5. João Rocha e Melo, Miguel Monteiro: 56.34%
  6. António Lopes: 55.62%
  7. Tiago Santos, Nuno Xu: 51.77%
  8. Bruno Henriques, Joana Lapas: 50.25%
  9. Sandro Nunes: 47.86%
  10. Gonçalo Correia: 47.61%
  11. Ricardo Amendoeira: 45.56%
  12. Jorge Matos: 40.43%
  13. Luis Novoa, Maria Carvalho: 0.00% (file missing results)

#####Packages required:

About

Participation on the JEEC Unbabel Challenge.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors