kaggle submission of the team 'poweredByTalkwalker' for the Allen AI competition

This code allows to reproduce our solution, which scored 0.5900 on the public leaderboard and 0.58344 on the private one from Friday, 12th February 2016, 07:28:58 UTC. We worked on an i7 quad-core machine with 32GB RAM and 400GB SSD disk running Ubuntu 14.04.3 LTS.

model execution

This part has as dependencies only R(3.2.2) with packages plyr(1.8.3), dplyr(0.4.3), reshape2(1.4.1), caret(6.0-57) and xgboost(0.4-2).

In order to reproduce our submission, place the training_set.tsv, validation_set.tsv and test_set.tsv files into the R/input folder. Then execute

./scripts/unsplitRData.sh

followed by

cd R && mkdir output && R --no-save < createInputFile.R && R --no-save < runModel.R

This will produce the final submission in the folder R/output.

full model prediction pipeline

In order to explore the model prediction in detail, we here also provide the code to build the IR retrieval indices from scratch. Be aware, that this part will download more than 25GB of data, requires a running ElasticSearch installation with around 100 GB of disk space and the runtime of all scripts together will be around 5 days.

As system requirements this part needs in addition

most of the times Java 7 (Java 8 for NVAO transformation)
ElasticSearch installation 1.7.3
gradle 2.11
python 2.7 + NLTK(3.1)

and R packages

data.table(1.9.6)
elastic(0.5.0)
elasticdsl(0.0.3.9500)
FeatureHashing(0.9.1.1)
hash(2.2.6)
httr(1.0.0)
jsonlite(0.9.17)
Matrix(1.2-2)
rJava(0.9-7)
RWeka(0.4-24)
stringr(1.0.0)
text2vec(0.2.1)
tm(0.6-2)

(moreover Apache Tika, JSOUP, Stanford NLP library will be automatically downloaded by the gradle scripts)

data preparation

execute download scripts from the folder scripts
download and install WikiExtractor and transform the dumps downloaded from Wikipedia, e.g.

WikiExtractor.py -o tmp/simplewiki -b 100G -s -ns Article --no-templates tmp/simplewiki-20151020-pages-articles.xml.bz2
execute transformation scripts as described in java/transformation
use quizlet.py to retrieve quizlet data (see also Readme in python folder)
transform the quizlet data by executing R --no-save < R/utils/importQuizlet.R
execute NVAO processing as described in java/nvao on the kaggle input files in R/input

ElasticSearch index creation

execute the lines in es/Readme.md

R pipeline

build the jar in java/stemming according to the Readme.md and copy the result from build/libs into R/lib
run the R script

cd R && R --no-save < runFullModelPipeline.R

training

in order to re-train the model, execute

cd R && R --no-save < trainModel.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kaggle submission of the team 'poweredByTalkwalker' for the Allen AI competition

model execution

full model prediction pipeline

data preparation

ElasticSearch index creation

R pipeline

training

FilesExpand file tree

Readme.md

Latest commit

History

Readme.md

File metadata and controls

kaggle submission of the team 'poweredByTalkwalker' for the Allen AI competition

model execution

full model prediction pipeline

data preparation

ElasticSearch index creation

R pipeline

training