JEECUnbabelChallenge

Participation on the JEEC Unbabel Challenge.

At JEEC16 (a Electrical and Computer Engineering event organized by the student at Técnico) Unbabel (a Portuguese startup that does human powered machine translation) went to give a talk and proposed a challenge to the students attendine. We (João Almeida and Ricardo Martins) decided to participate, this repository contains all the work we did during the 11-day Challenge, from 11/03/2016 to 21/03/2016.

####Goal: Create a ML classifier to distinguish between machine translations and human translations.

####Data Used: The available dataset consisted of about 20K labeled phrases. And the final test set of 3220 phrases to classify.

####Scripts:

FeatureGen: Script to calculate Handmade features created with the intention to find translater errors;
ClassifiersTest: Script to test different classifiers on the training data and then fine tune their parameters. Created from the template at DataScienceTools;
ClassifyTestSet: Script to train the final classifier and generate the predictions for the test set;
ProcessTextIntoSamples: Script to transform external text into usable text samples;
PostTag: Set of functions to convert phrases into a n-gram POS tagged representation of the text;
FirstTests: Mix of testing and POS tagging; TODO: clean up;

####Approach:

TODO: Explain approach.

####Final Results: Voting Classifier based on LDA, LogisticRegression and AdaBoost with Dimensionality reduction by Pricinpal component analysis achieved a cross validation score of (55.0 +/- 1.6)% and a final result of 57.8%.

#####Global Results:

Francisco Dias: 59.72%
Catarina Silva: 59.22%
Miguel Borges Ribeiro, Tiago Baltazar: 58.39%
João Almeida, Ricardo Martins: 57.80%
João Rocha e Melo, Miguel Monteiro: 56.34%
António Lopes: 55.62%
Tiago Santos, Nuno Xu: 51.77%
Bruno Henriques, Joana Lapas: 50.25%
Sandro Nunes: 47.86%
Gonçalo Correia: 47.61%
Ricardo Amendoeira: 45.56%
Jorge Matos: 40.43%
Luis Novoa, Maria Carvalho: 0.00% (file missing results)

#####Packages required:

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Data		Data
DataGathering		DataGathering
Scripts		Scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JEECUnbabelChallenge

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

JEECUnbabelChallenge

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages